The first instalment of this series [Neri19] introduced the minverse algorithm and a C++ library called qmodular [qmodular] that implements it. We have seen that in certain cases minverse performs better than the algorithm currently implemented by major compilers when evaluating expressions of the form n % d == r
. Unfortunately, minverse does not work when ==
is replaced by <
, <=
, >
or >=
. This article presents two algorithms that do not have this restriction, namely, Multiply and Shift (mshift) and Multiply and Compare (mcomp). They are very similar to one another and also to the Remainder by Multiplication and Shifting Right (RMSR) covered in Hacker’s Delight [Warren13].
It is important to remember that the intention here is not to ‘beat’ the compiler. On the contrary, this series is an open letter addressed to compiler writers presenting some algorithms that, potentially, could be incorporated into their product for the benefit of all programmers. Performance analysis shows that the alternatives discussed in this series are often faster than builtin implementations.
Recall and warm up
We start by looking at Figure 1^{1} which displays the time taken to check whether each element of an array of 65,536 uniformly distributed unsigned 32bits dividends in the interval [0, 10^{6}] leaves remainder 3 when divided by 10. Bars built_in and minverse correspond to algorithms presented in [Neri19]. (Builtin is simply^{2} n % 10 == 3
.) Bars mshift and mcomp refer to algorithms covered by this article. A simple variant of each, mshift_promoted and mcomp_promoted, are also shown.
Figure 1 
All measurements include the time to scan the array of dividends which is used as unit of time. Timings^{3} are 2.14 for builtin, 1.71 for minverse, 1.47 for mshift and mcomp and 1.30 for promoted variants. Subtracting the scanning time and taking results relatively to builtin’s yields 0.71/1.14 ≈ 0.62 for minverse, 0.47/1.14 ≈ 0.41 for mshift and mcomp and 0.30/1.14 ≈ 0.26 for promoted algorithms. These numbers, however, depend on the divisor as will be made clear later on.
Listing 1 contrasts the code generated by GCC 8.2.1 with O3
for some of these algorithms.
built_in 0: mov %edi,%eax 2: mov $0xcccccccd,%edx 7: mul %edx 9: shr $0x3,%edx c: lea (%rdx,%rdx,4),%eax f: add %eax,%eax 11: sub %eax,%edi 13: cmp $0x3,%edi 16: sete %al 19: retq minverse 0: sub $0x3,%edi 3: imul $0xcccccccd,%edi,%edi 9: ror %edi b: cmp $0x19999999,%edi 11: setbe %al 14: retq mshift 0: imul $0x1999999a,%edi,%edi 6: shr $0x1c,%edi 9: cmp $0x4,%edi c: sete %al f: retq mcomp 0: sub $0x3,%edi 3: imul $0x1999999a,%edi,%edi 9: cmp $0x1999999a,%edi f: setb %al 12: retq 
Listing 1 
Recall that we are concerned with the evaluation of modular expressions where the divisor is a compile time constant and the dividend is a runtime variable. The remainder can be either. They all have the same unsigned integer type which implements modulus 2w arithmetic. (Typically, w = 32 or w = 64.) We focuses on GCC 8.2.1 for x86_64 target but some ideas here might also apply to other platforms.
The ‘Remainder by Multiplication and Shifting Right’ algorithm
This section loosely follows [Warren13] and covers the basics of RMSR by means of an example. We shall see that RMSR, mshift and mcomp have the same underlying idea.
Let the divisor be d = 10. By Euclidean division, any integer n can be uniquely written as n = 10 ∙ q + r, where q and r are integers, r ∈ [0, 9]. Written in code, q = n / 10
and r = n % 10
. Using this expression we obtain
⌊16 ∙ (n / 10)⌋ = ⌊16 ∙ ((10 ∙ q + r) / 10)⌋ = ⌊16 ∙ q + (16 ∙ r) / 10)⌋
= 16 ∙ q + ⌊16 ∙ (r / 10)⌋.
It follows that ⌊16 ∙ (n / 10)⌋ ≡ ⌊16 ∙ (r / 10)⌋ (mod 16).
[Warren13] observes that for r = 0, 1, 2, 3, 4, 5, 6, 7, 8 and 9, the quantity ⌊16 ∙ (r / 10)⌋ takes values on 0, 1, 3, 4, 6, 8, 9, 11, 12 and 14, respectively. Since the latter numbers are all distinct they can be mapped back to the former. Therefore, we efficiently obtain r provided we can reasonably evaluate:
 ⌊16 ∙ (n / 10)⌋ mod 16; and
 the inverse of φ defined by φ(r) = ⌊16 ∙ (r / 10)⌋, for all r ∈ {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}.
We consider the evaluation of ⌊16 ∙ (n / 10)⌋ mod 16 by looking at the binary representation of intermediate results, namely, x_{1} = n / 10, x_{2} = 16 ∙ x_{1}, x_{3} = ⌊x_{2}⌋ and x_{4} = x_{3} mod 16. To get x_{2} we multiply x_{1} by 16, which shifts bits 4 positions to the left. In particular, the first 4 bits of x_{1} on the right of the binary point move to its left. To get x_{3}, we apply the integer part function discarding fractional bits. Finally, to get x_{4} we take modulo 16 zeroing all but the 4 leftmost bits. For example, for n = 98 we have (relevant bits are emphasised):
x_{1} = 98 / 10 = 9.8 = (1001.11001100…)_{2},
x_{2} = 16 ∙ x_{1} = 156.8 = (10011100.1100…)_{2},
x_{3} = ⌊x_{2}⌋ = 156 = (10011100)_{2},
x_{4} = x3 mod 16 = 12 = (1100)_{2}.
The ellipses above serve as reminders that those equalities hold in the beautiful world of infinite precision which our CPUs do not belong. Rounding is necessary and error is unavoidable. Caution is required to avoid trashing the bits that we are interested in. Hence, we look for an approximation of x_{1} = n / 10 known to be correct at least up to 4 bits after the binary point or, equivalently, an approximation of x_{2} = (2^{4} / 10) ∙ n correct, at least, on its integer part. There is an efficient way to tackle this problem. It is a classical idea widely used by compilers to evaluate division by compile time constants. For the sake of concreteness, we restraint this discussion to 32bits CPUs.
We have x_{2} = (2^{4} / 10) ∙ n = (2^{32} / 10) ∙ n / 2^{28}. Here the rounding comes: number (2^{32} / 10) = 429,496,729.6 is rounded up to M = 429,496,730 ([Warren13] rounds it down) and we get x_{2} ≈ M ∙ n / 2^{28}. The error of this approximation, (M  (2^{32} / 10)) ∙ n / 2^{28}, grows with n and thus, for n large enough, it becomes such that x_{2} and M ∙ n / 2^{28} do not have the same integer part. For such values x_{3} ≠ ⌊M ∙ n / 2^{28}⌋ and the algorithm breaks. For instance, for n = 2^{27}, we have x_{2} = (2^{4} / 10) ∙ 2^{27} = 2^{31} / 10 = 214,748,364.8 whereas M ∙ n / 2^{28} = M ∙ 2^{27} / 2^{28} = M / 2 = 214,748,365.
Despite the last issue, there exist a maximal interval [0, N] such that x_{3} = ⌊M ∙ n / 2^{28}⌋ for any integer n ∈ [0, N]. Finishing part 1 of the RMSR requires evaluating x_{4} = x_{3} mod 16. However, by yet another practical aspect of CPUs, this step becomes unnecessary. Indeed, ⌊M ∙ n / 2^{28}⌋ is evaluated as (M * n) >> 28
in modulus 2^{32} arithmetic. Hence all but the 32 leftmost bits of M * n
are discarded. After the 28bits right shift only the 4 leftmost bits of the result might be nonzero. Hence, the subsequent mod 16 operation has no effect.
The consideration for n = 2^{27} above implies N < 2^{27}. In particular, this method does not work on the whole range of 32bits unsigned integers. [Warren13] tackles this issue at the same time as he deals with the evaluation of φ^{1}. (Part 2 of RMSR.) This is not straightforward and he covers it only for a handful of divisors. His approach uses bitwise tricks obtained, as he puts it, by ‘a lot of trial and error’ (which is not generic) and lookup tables (which does not scale well). Details can be seen in [Warren13].
The mshift algorithm
Luckily, mshift and mcomp seek to make remainder comparisons without knowing it. For this reason they do not need to evaluate φ^{1} and only take into consideration the fact that φ is strictly increasing.
We have seen that if r is the remainder of the division of n by 10, then ⌊16 ∙ (n / 10) mod 16 = φ(r). Since φ is strictly increasing, we have r = 0 if, and only if, φ(r) = φ(0) = 0, ie, ⌊16 ∙ (n / 10)⌋ mod 16 = 0. Therefore, the latter equality is equivalent to divisibility by 10. Finally, using the efficient evaluation of ⌊16 ∙ (n / 10)⌋ mod 16, we get that for n ∈ [0, N], n % 10 == 0
is equivalent to (M * n) >> 28 == 0
. The same applies to, say, r = 3: if n ∈ [0, N], then n % 10 == 3
if, and only if, (M * n) >> 28 == (M * 3) >> 28
. (This is mshift in Figure 1 and Listing 1.)
Summarising, let uint_t
be an unsigned integer type with modulus 2^{w} arithmetic. Given any integer d ∈ [2, 2^{w}[, let p be the smallest integer such that d ≤ 2^{p}. Set M = ⌈2^{w} / d⌉ and s = w  p. Then, there exists an integer N ∈ [0, 2^{w}[ such that for integers n ∈ [0, N], r ∈ [0, d  1] and any relational operator ⋚ (i.e., ⋚ is any of ==
, !=
, <
, <=
, >
or >=
), we have n % d
⋚ r
is equivalent to phi(n)
⋚ phi(r)
, where phi
is defined by
uint_t phi(uint_t n) { return (M * n) >> s; }
A lower bound for N is straightforward to calculate but its derivation is outside the scope of this article. (See [qmodular].) Most often, we have N < 2^{w}  1 and, consequently, mshift cannot be applied to the whole range of uint_t
. We shall see later how to deal with this limitation.
The mcomp algorithm
The underlying idea of mcomp is similar to mshift’s, the difference being a simple observation. The essential point of mshift is looking at a few bits of n / d after the binary point which are obtained through phi
. This function introduces a rounding error that grows with n and, eventually, gets large enough to trash the important bits. At this point phi
returns a wrong value and mshift breaks. Looking at more bits on the right of the binary point enables mcomp to carry on.
Consider the example d = 10 again. As in mshift, the first step of mcomp is evaluating the approximation ⌈2^{32} / 10⌉ ∙ n, of (2^{32} / 10) ∙ n, under modulus 2^{32} arithmetic. Table 1 shows both products, in binary, for a few values of n. For easy of presentation, it only shows the most and least significant bytes of the 32 bits on the left of the binary point (ellipses account for the other 16 bits). (The 4 bits that are important to mshift are emphasised.)


Table 1 
When n divides 10 the second column shows 0 since (2^{32} / 10) ∙ n is a multiple of 2^{32}. The third column only approximates the second one: it starts showing 0, for n = 0, and steadily diverges from this value as n increases. Indeed, from n to n + 10 the error goes up by 4 (or (100)_{2}). (This is due to ⌈2^{32} / 10⌉ ∙ 10  (2^{32} / 10) ∙ 10 = ⌈2^{32} / 10⌉ ∙ 10  2^{32} = 4.) Therefore, starting at 0, after 2^{26} steps the error accrues to 4 ∙ 2^{26} = 2^{28} for n = 10 ∙ 2^{26}. This is large enough to trash the 4 leftmost bits of M ∙ n, which become (0001)_{2}. Notice the latter is the bit pattern expected for remainder r = 1 and not for r = 0 and then, mshift breaks at this point.
However, looking at all bits of the result, (0001 0000 … 0000 0000)_{2}, we realise it is still far below (0001 1001 … 1001 1010)_{2} which is the result for n = 1, that is, M = ⌈2^{32} / 10⌉ ∙ 1 = ⌈2^{32} / 10⌉. Therefore, if that n is not large enough for M ∙ n to reach the barrier M, then it is still possible to detect that n is multiple of 10. In other words, n % 10 == 0
is equivalent to M * n < M
.
In general, for any remainder r ∈ [0, d  1], the result of M ∙ n
mod 2^{32 }starts, for n = r, with a small error with respect to (2^{32} / 10) ∙ r. The error steadily increases in steps of size 4 as n takes successive values with the same remainder. If n is not large enough for M ∙ n to reach the next barrier, M ∙ (r + 1), then it is still possible to detect that n has remainder r. This means that n % 10 == r
is equivalent to (M * r <= M * n) && (M * n < M * (r + 1))
. Well, this is not entirely true and some details must be considered.
For r = 9, we have M ∙ (r + 1) = M ∙ 10 = 2^{32} + 4 ≥ 2^{32}. Hence, regardless of n, the second operand of &&
above should be true
. Nevertheless, its evaluation can yield false
since M ∙ (r + 1) overflows under modulus 2^{32} arithmetic. To get the right result when r = 9, we should test only the left hand side of &&
above.
Regarding &&
, this operator creates a branch that can immensely degrade performance. To address this issue, notice that by subtracting M ∙ r from M ∙ r ≤ M ∙ n < M ∙ (r + 1) we get that these inequalities are equivalent to 0 ≤ M ∙ (n  r) < M. However, under modulus 2^{32} arithmetic, the subtraction overflows when n < r. Although the correct derivation in modulus 2^{32} arithmetic is not that straightforward, the previous result remains valid. Furthermore, 0 ≤ M ∙ (n  r) is always true and then, we can evaluate n % 10 == r
as M * (n  r) < M
, except when r = 9. In this case, the upper bound M
should be replaced by M  4
to account for the fact that M * 10
overflows by an excess of 4
.
There are outstanding issues to be addressed but we evasively refer to [qmodular] for details.
We now summarise the mcomp algorithm. Let uint_t
be an unsigned integer type with modulus 2w arithmetic. Given any integer d ∈ [2, 2^{w}[, set M = ⌈2w / d⌉ and m = M ∙ d  2^{w}. If m < M, then there exists an integer N ∈ [0, 2^{w}[ such that for all integers n ∈ [0, N] and r ∈ [0, d  1], we have:
n % d == r
is equivalent toM * (n  r) < M
, if r ≠ d  1;M * (n  r) < M – m
, if r = d  1;M * n >= M * r
, if r = d  1;
n % d < r
is equivalent toM * n < M * r
.
From these results we can derive similar equivalences for other relational operators.
The unbounded case
Like minverse, r ∈ [0, d  1] is a precondition for mshift and mcomp. To extend these algorithms to larger remainders, the same approach used for minverse can be applied. (See [Neri19].)
Increasing the range of applicability
Both mshift and mcomp have a range of applicability, (expressed by the hypothesis n ∈ [0, N]) which, most often, is not the whole domain of the unsigned integer types they work on. (In contrast, builtin and minverse do not have this limitation.)
An easy way of increasing these algorithm’s applicability is by promoting the type that they work on. Specifically, to evaluate a modular expression with uint32_t
operands, we can promoted n
, d
and r
to uint64_t
and use the algorithm for this type. This is what the suffix _promoted in Figure 1 refers to.
Although GCC provides 128bit unsigned integer types, on x86_64 platforms these types only have partial support from the CPU and must be synthesised by software. Therefore, the promotion strategy might be rather expensive for uint64_t
and using mshift or mcomp for this type of operand might not be the best option if performance and full range of applicability are simultaneously needed.
Performance analysis
As in the warm up, all measurements shown in this section concern the evaluation of modular expressions for 65,536 uniformly distributed unsigned 32bits dividends in the interval [0, 10^{6}]. Remainders can be either fixed at compile time or variable at runtime. Charts show divisors on the xaxis and time measurements, in nanoseconds, on the yaxis. Timings are adjusted to account for the time of array scanning.
For clarity we restrict divisors to [1, 50] which suffices to spot trends. (Results for divisors up to 1,000 are available in [qmodular].) In addition, we filter out divisors that are powers of two since the bitwise trick is undoubtedly the best algorithm for them. The timings were obtained with the help of Google Benchmark [Google] running on an AMD Ryzen 7 1800X EightCore Processor @ 3600Mhz; caches: L1 Data 32K (x8), L1 Instruction 64K (x8), L2 Unified 512K (x8), L3 Unified 8192K (x2).
Figure 2 concerns divisibility tests, that is, evaluation of n % d == 0
. As already seen in [Neri19], builtin is slower than minverse which zigzags as divisors changes from odd (faster) to even (slower) values. The new algorithms, mshift and mcomp, perform for all divisors as good as minverse does for odd divisors. In contrast, their promoted variants perform better than minverse for even divisors. This makes the minverse preferable for odd divisors since it does not have the limitation on the range of n
whereas either of the promoted variants is preferable for even divisors.
Figure 2 
Figure 3 covers n % d == r
where r
is variable and uniformly distributed in [0, d  1]. In this example, r < d
always holds true but the compiler does not know it and adds a precondition check before calling mshift, mcomp and their promoted variants. For clarity, we do not included minverse but it is worth remembering (see [Neri19]) that it is slower than the builtin algorithm except for the few divisors for which the latter spikes. More or less the same happens here for mshift and its promoted flavour. The good news is that both variations of mcomp are faster than the builtin algorithm.
Figure 3 
Finally, Figure 4 considers the expression n % d > 1
. (Recall that minverse cannot evaluate this expression.) The builtin algorithm performs worse than all others and the promoted variations perform worse than their regular counterparts.
Figure 4 
Related work
As we have seen, mshift and mcomp follow the idea of RMSR as presented in [Warren13]. Furthermore, a recent work [Lemire19] presents a slight variation of promoted mcomp for uint32_t
. However, [Lemire19] restricts its discussion and analysis to divisibility tests, that is, to expressions of the form n % d == 0
.
Conclusion
This article presents the mshift and mcomp algorithms, which make remainder comparisons without knowing its value. In contrast to minverse, seen in the previous instalment of this series, they allow the usage of any comparison operator. However, they have a limitation not shared by minverse: in general, their range of applicability is not as large as the range of the inputs’ type. To address this issue, this article presents the promoted variants which take uint32_t
inputs and make intermediate calculations on uint64_t
values. We have seen situations where these algorithms can perform better than minverse and the builtin algorithm currently implemented by major compilers.
Although we have worked around the limitation on inputs by promoting uint32_t
values to uint64_t
, a similar idea is not viable when the input is already of the latter type. More precisely, there is no efficient promoted variant when inputs are already of type uint64_t
. This is the greatest limitation of mshift and mcomp. In Part 3 of this series we shall see yet another algorithm that does not have this issue.
Acknowledgements
I am deeply thankful to Fabio Fernandes for the incredible advice he provided during the research phase of this project. I am equally grateful to the Overload team for helping improve the manuscript.
References
[Google] https://github.com/google/benchmark
[Lemire19] Daniel Lemire, Owen Kaser and Nathan Kurz, Faster Remainder by Direct Computation: Applications to Compilers and Software Libraries, Software: Practice and Experience 49 (6), 2019.
[Neri19] Cassio Neri, ‘Quick Modular Calculations (Part 1)’, Overload 154, pages 11–15, December 2019.
[qmodular] https://github.com/cassioneri/qmodular
[QuickBench] Quick C++ Benchmark: http://quickbench.com/u3y2EZt_F8eAtWmFimt0MrwfHS8
[Warren13] Henry S. Warren, Jr., Hacker’s Delight, Second Edition, Addison Wesley, 2013.
 Powered by quickbench.com. For readers who are C++ programmers and do not know this site, I strongly recommend checking it out. In addition, I politely ask all readers to consider contributing to the site to keep it running. (Disclaimer: apart from being a regular user and donor, I have no other affiliation with this site.)
 We are using bold fixedwidth font for code (as is usual in Overload), so 98 / 10 = 9.8 is maths and
98 / 10 == 9
is code.  YMMV, reported numbers were obtained by a single run in quickbench.com using GCC 8.2 with
O3
andstd=c++17
[QuickBench]. I do not know details about the platform it runs on, especially the processor
has a PhD in Applied Mathematics from Université de Paris Dauphine. He worked as a lecturer in Mathematics before moving to the financial industry.
Programming Topics + Overload Journal #155  February 2020
Browse in : 
All
> Topics
> Programming
All > Journals > Overload > o155 Any of these categories  All of these categories 