Programming Topics + Overload Journal #146 - August 2018
Browse in : All > Topics > Programming
All > Journals > Overload > o146
Any of these categories - All of these categories

Note: when you create a new publication type, the articles module will automatically use the templates user-display-[publicationtype].xt and user-summary-[publicationtype].xt. If those templates do not exist when you try to preview or display a new article, you'll get this warning :-) Please place your own templates in themes/yourtheme/modules/articles . The templates will get the extension .xt there.

Title: (Re)Actor Allocation at 15 CPU Cycles

Author: Bob Schmidt

Date: 04 August 2018 18:37:42 +01:00 or Sat, 04 August 2018 18:37:42 +01:00

Summary: (Re)Actor serialisation requires an allocator. Sergey Ignatchenko, Dmytro Ivanchykhin and Marcos Bracco pare malloc/free down to 15 CPU cycles.

Body: 

Disclaimer: as usual, the opinions within this article are those of ‘No Bugs’ Hare, and do not necessarily coincide with the opinions of the translators and Overload editors; also, please keep in mind that translation difficulties from Lapine (like those described in [Loganberry04]) might have prevented an exact translation. In addition, the translator and Overload expressly disclaim all responsibility from any action or inaction resulting from reading this article.

Task definition

Some time ago, in our (Re)Actor-based project, we found ourselves with a need to serialize the state of our (Re)Actor. We eventually found that app-level serialization (such as described in [Ignatchenko16]) is cumbersome to implement, so we decided to explore the possibility of serializing a (Re)Actor state at allocator level. In other words, we would like to have all the data of our (Re)Actor residing within a well-known set of CPU/OS pages, and then we’d be able to serialize it page by page (it doesn’t require app-level support, and is Damn Fast™; dealing with ASLR when deserializing at page level is a different story, which we hope to discuss at some point later).

However, to serialize the state of our (Re)Actor at allocator level, we basically had to write our own allocator. The main requirements for such an allocator were that:

Actually, when we realized that we only needed to consider single-threaded code was when we thought, ‘Hey! This can be a good way to improve performance compared to industry-leading generic allocators’. Admittedly, it took more effort than we expected, but finally we have achieved results which we think are interesting enough to share.

What our allocator is NOT

By the very definition of our task, our allocator does not aim to be a drop-in replacement for existing allocators (at least, not for all programs). Use of our allocator is restricted to those environments where all accesses to a certain allocator are guaranteed to be single-threaded; two prominent examples of such scenarios are message-passing architectures (such as Erlang) and (Re)Actors a.k.a. Actors a.k.a. Reactors a.k.a. ad hoc FSMs a.k.a. Event-Driven Programs.

In other words, we did not really manage to outperform the mallocs we refer to below; what we managed to do was to find a (very practical and very important) subset of use cases (specifically message passing and (Re)Actors), and write a highly optimized allocator specifically for them. That being said, while writing it, we did use a few interesting tricks (discussed below), so some of our ideas might be usable for regular allocators too.

On the other hand, as soon as you’re within (Re)Actor, our allocator does not require additional programming effort from the app-level; this gives it an advantage over manually managed allocation strategies such as used by Boost::Pool (not forgetting that, if necessary, you can still use Boost::Pool over our allocator).

Major design decisions

When we started development of our allocator (which we named iibmalloc, available at [Github]), we needed to make a few significant decisions.

First, we needed to decide how to achieve multiple allocators per process, preferably without specifying an allocator at app-level explicitly. We decided to handle it via TLS (thread_local in modern C++). Very briefly:

Second, we needed to decide whether we want to spend time keeping track of whole pages becoming empty so we can release them. Based on the logic discussed in [NoBugs18], we decided in favor of not spending any effort on keeping track of allocation items as long as there is more than one such item per CPU page.

Third, in particular based on [NoBugs16], we aimed to use as few memory accesses as humanly possible. Indeed, on modern CPUs, register–register operations (which take ~1 CPU cycle) are pretty much free compared to memory accesses (which can go up to 100+ CPU cycles).

Implementation

We decided to split all our allocations into four groups depending on their size:

Optimizing allocation – calculating logarithms

Up to now, everything has been fairly obvious. Now, we can get to the interesting part: specifically, what did we do to optimize our allocator? First, let’s note that we spent most of our time optimizing ‘small’ and ‘medium’ allocations (on the basis that they’re by far the most popular allocs in most apps, especially in (Re)Actor apps).

The first problem we faced when trying to optimize small/medium allocations was that – given the allocation size, which comes in a call to our malloc() – we need to calculate the bucket number. As our bucket sizes are exponents, this means that effectively we had to calculate an (integer) logarithm of the allocation size.

If we have bucket sizes of 8, 16, 32, 64, … – then calculating the integer logarithm (more strictly, finding the greatest integer so that two raised to that integer is less or equal to the allocation size) becomes a cinch. For example, on x64 we can/should use a BSR instruction, which is extremely fast. (How to ensure that our code generates a BSR is a different compiler-dependent story but it can be done for all major compilers.) Once we have our BSR, skipping some minor details, we can calculate bucket_number = BSR(size-1)-2, or, in terms of bitwise arithmetic, the ordinal number of the greatest bit set of (size-1) minus two.

However, having bucket sizes double at each step leads to significant overheads, so we decided to go for a ‘half-exponent’ sequence of 8, [12 omitted due to alignment requirements], 16, 24, 32, 48, 64, … In this case, the required logarithm to find our bucket size can still be calculated very quickly along quite similar lines: it is a doubled ordinal number of a greatest bit set of (size-1) plus second greatest bit of (size-1) minus five.

These are still register-only operations, are still branch-free, and are still extremely fast. In fact, when we switched to ‘half-exponent’ buckets, we found that – due to improved locality – the measured speed improved in spite of the extra calculations added.

Optimizing deallocation – placing information in a dereferenceable pointer?!

The key, the whole key, and nothing but the key, so help me Codd~ unknown

Up to now, we have described nothing particularly interesting. It was when we faced the problem of how to do deallocation efficiently that we got into the really interesting stuff.

Whenever we get a free() call, all we have is a pointer, and nothing but a pointer (for C++ delete). And from this single pointer we need to find: (a) whether it is to a ‘small’, ‘medium’, ‘large’, or ‘very large’ allocated block, and for small/medium blocks, we have to find (b) which of the buckets it belongs to.

Take 1 – Header for each allocation item

The most obvious (and time-tested) way of handling it is to have an allocated-item header preceding each allocation item, which contains all the necessary information. This works, but requires 2 memory reads (cached ones, but still taking 3 cycles or so each) and, even more importantly, the item header cannot be less than 8 bytes (due to alignment requirements), which means up to twice the overhead for smaller allocation sizes (which also happen to be the most popular ones).

We tried this one, it did work – but we were sure it was possible to do it better.

Take 2 – Dereferenceable pointers and bucket page headers

For our next step, we had two thoughts:

This approach worked, but while it reduced memory overhead, the cost of the indirection to the page_header (which was less likely to be cached than the allocation item header) was significant, so we observed minor performance degradation.☹

Take 3 - Storing the bucket number within a dereferencable pointer

However, (fortunately) we didn’t give up - and came up with the following schema, which effectively allows us to extract the bucket number from each small/medium allocated pointer. It requires a bit of explanation.

Whenever we’re allocating a bunch of pages from the OS (via mmap()/VirtualAllocEx()) – we can do it in the following manner:

After we’re done with this, we can say that:

For each and every ‘small’/‘medium’ pointer to be freed, the expression ((pointer_to_be_freed>>12)&0xF) gives us the bucket number.

This information can be extracted purely from the pointer, without any indirections(!). In other words, by doing some magic we did manage to put information about the bucket number into the pointer itself(!!).

In practice, it was a bit more complicated than that (to avoid creating too many VMAs, we needed to reserve/commit pages in larger chunks – such as 8M), but the principles stated above still stand in our implementation.

This approach happened to be the best one both performance-wise and memory-overhead-wise.

How our deallocation works

To put all the pieces of our deallocation together, let’s see how our deallocation routine works:

This is the most time-critical path – and we got it in a very few operations (maybe even close to ‘the-least-possible’). Bingo!

Test results

Of course, all the theorizing about ‘we have very few memory accesses’ is fine and dandy, but to map them into real world, we have to run some benchmarks. So, after all the optimizations (those above and others, such as forcing the most critical path – and only the most critical path – to be inlined), we ran our own ‘simulating real-world loads’ test [Ignatchenko18] and compared our iibmalloc with general-purpose (multithreaded) allocators. We feel that the results we observed for our iibmalloc were well-worth the trouble we took while developing it.

The testing is described in detail in [Ignatchenko18], with just a few pointers here:

As we can see (Figure 1), CPU-wise, we were able to outperform all the allocators at least by 1.5x.

Figure 1

And from the point of view of memory overhead (Figure 2), our iibmalloc has also performed well: its overhead was pretty much on par with the best alloc we have seen overhead-wise (jemalloc) – while significantly outperforming it CPU-wise.

Figure 2

Note that comparison of iibmalloc with other allocs is not a 100% ‘fair’ comparison: to get these performance gains, we had to give up on support for multi-threading. However, whenever you can afford to keep the Shared-Nothing model (=‘sharing by communicating instead of communicating by sharing memory’), this allocator is likely to improve the performance of malloc-heavy apps.

Another interesting observation can be seen in the graph in Figure 3, which shows results of a different bunch of tests, changing the size of allocated memory.

<IMAGE xml:link="simple" href="Ignatchenko-4.gif" show="embed" actuate="auto"/>
Figure 3

NB: Figure 3 is for a single thread, which as we seen above is the very best case for tcmalloc; for larger number of threads, tcmalloc will start to lose ground.

On the graph, we can see that when we’re restricting our allocated data set to single-digit-megabytes (so everything is L3-cached and significant parts are L2-cached), then the combined costs of a malloc()/free() pair for our iibmalloc can be as little as 15 CPU clock cycles(!). For a malloc()/free() pair, 15 CPU cycles is a pretty good result, which we expect to be quite challenging to beat (though obviously we’ll be happy if somebody does). On the other hand, as we have spent only a few man-months on our allocator, there is likely quite a bit of room for further improvements.

Conclusions

We presented an allocator which exhibits significant performance gains by giving up multi-threading. We did not really try to compete with other allocators (we’re solving a different task, so it is like comparing apples and oranges); however, we feel that we can confidently say that

For (Re)Actors and message-passing programs in general, it is possible to have a significantly better-performing allocator than a generic multi-threaded one.

As a potentially nice side-effect, we also demonstrated a few (hopefully novel – at least we haven’t run into them before) techniques, such as storing information in dereferenceable pointers, and these techniques might (or might not) happened to be useful for writers of generic allocators too.

References

[Github] https://github.com/node-dot-cpp/iibmalloc

[Ignatchenko16] Sergey Ignatchenko and Dmytro Ivanchykhin (2016) ‘Ultra-fast Serialization of C++ Objects’, Overload #136

[Ignatchenko18] Sergey Ignatchenko, Dmytro Ivanchykhin, and Maxim Blashchuk (2018) ‘Testing Memory Allocators: ptmalloc2 vs tcmalloc vs hoard vs jemalloc While Trying to Simulate Real-World Loads’, http://ithare.com/testing-memory-allocators-ptmalloc2-tcmalloc-hoard-jemalloc-while-trying-to-simulate-real-world-loads/

[Loganberry04] David ‘Loganberry’, Frithaes! – an Introduction to Colloquial Lapine!, http://bitsnbobstones.watershipdown.org/lapine/overview.html

[NoBugs16] ‘Operation Costs in CPU Clock Cycles’, ‘No Bugs’ Hare, http://ithare.com/infographics-operation-costs-in-cpu-clock-cycles/

[NoBugs18] ‘The Curse of External Fragmentation: Relocate or Bust!’, ‘No Bugs’ Hare, http://ithare.com/the-curse-of-external-fragmentation-relocate-or-bust/

Notes: 

More fields may be available via dynamicdata ..