Programming Topics + Overload Journal #146 - August 2018

Browse in :

All > Topics > Programming
All > Journals > Overload > o146
Any of these categories - All of these categories

Note: when you create a new publication type, the articles module will automatically use the templates user-display-[publicationtype].xt and user-summary-[publicationtype].xt. If those templates do not exist when you try to preview or display a new article, you'll get this warning :-) Please place your own templates in themes/yourtheme/modules/articles . The templates will get the extension .xt there.

Title: (Re)Actor Allocation at 15 CPU Cycles

Author: Bob Schmidt

Date: 04 August 2018 18:37:42 +01:00 or Sat, 04 August 2018 18:37:42 +01:00

Summary: (Re)Actor serialisation requires an allocator. Sergey Ignatchenko, Dmytro Ivanchykhin and Marcos Bracco pare malloc/free down to 15 CPU cycles.

Body:

Disclaimer: as usual, the opinions within this article are those of â€˜No Bugsâ€™ Hare, and do not necessarily coincide with the opinions of the translators and Overload editors; also, please keep in mind that translation difficulties from Lapine (like those described in [Loganberry04]) might have prevented an exact translation. In addition, the translator and Overload expressly disclaim all responsibility from any action or inaction resulting from reading this article.

Task definition

Some time ago, in our (Re)Actor-based project, we found ourselves with a need to serialize the state of our (Re)Actor. We eventually found that app-level serialization (such as described in [Ignatchenko16]) is cumbersome to implement, so we decided to explore the possibility of serializing a (Re)Actor state at allocator level. In other words, we would like to have all the data of our (Re)Actor residing within a well-known set of CPU/OS pages, and then weâ€™d be able to serialize it page by page (it doesnâ€™t require app-level support, and is Damn Fastâ„¢; dealing with ASLR when deserializing at page level is a different story, which we hope to discuss at some point later).

However, to serialize the state of our (Re)Actor at allocator level, we basically had to write our own allocator. The main requirements for such an allocator were that:

We need a separate allocator for each of (Re)Actors running in the same process
At the same time, we want our app-level (Re)Actors to be able to use simple new and delete interfaces, without specifying allocators explicitly
We need each of our allocators to reside in a well-defined set of CPU pages (those pages obtained from the OS via mmap()/VirtualAllocEx())
This will facilitate serialization (including stuff such as Copy on Write in the future).
We do NOT need our allocator to be multi-threaded. In the (Re)Actor model, all accesses from within (Re)Actor belong to one single thread (or at least â€˜as ifâ€™ they are one single thread), which means that we donâ€™t need to spend time on thread sync, not even on atomic accesses.
The only exception is when we need to send a message to another (Re)Actor. In this case, we MAY need thread-safe memory but, in comparison with intra-(Re)Actor allocations, this is a very rare occurrence â€“ so we can either use standard malloc() for this purpose or write our own message-oriented allocator. The latter will have very different requirements and therefore very different optimizations from the intra-(Re)Actor allocator discussed in this article.

Actually, when we realized that we only needed to consider single-threaded code was when we thought, â€˜Hey! This can be a good way to improve performance compared to industry-leading generic allocatorsâ€™. Admittedly, it took more effort than we expected, but finally we have achieved results which we think are interesting enough to share.

What our allocator is NOT

By the very definition of our task, our allocator does not aim to be a drop-in replacement for existing allocators (at least, not for all programs). Use of our allocator is restricted to those environments where all accesses to a certain allocator are guaranteed to be single-threaded; two prominent examples of such scenarios are message-passing architectures (such as Erlang) and (Re)Actors a.k.a. Actors a.k.a. Reactors a.k.a. ad hoc FSMs a.k.a. Event-Driven Programs.

In other words, we did not really manage to outperform the mallocs we refer to below; what we managed to do was to find a (very practical and very important) subset of use cases (specifically message passing and (Re)Actors), and write a highly optimized allocator specifically for them. That being said, while writing it, we did use a few interesting tricks (discussed below), so some of our ideas might be usable for regular allocators too.

On the other hand, as soon as youâ€™re within (Re)Actor, our allocator does not require additional programming effort from the app-level; this gives it an advantage over manually managed allocation strategies such as used by Boost::Pool (not forgetting that, if necessary, you can still use Boost::Pool over our allocator).

Major design decisions

When we started development of our allocator (which we named iibmalloc, available at [Github]), we needed to make a few significant decisions.

First, we needed to decide how to achieve multiple allocators per process, preferably without specifying an allocator at app-level explicitly. We decided to handle it via TLS (thread_local in modern C++). Very briefly:

By task definition, our allocator is single-threaded.
This allows us to speak in terms of the â€˜function which is currently executed within the current threadâ€™.
Moreover, at any point in time, we can say which allocator is currently used by the app level (it is the allocator which belongs to the currently running (Re)Actor).
Even if there are multiple (Re)Actors per thread, this still stands.
Hence, a thread_local allocator will do the job:
- For a single (Re)Actor per thread, we can have a per-thread allocator (this is the model we were testing for the purposes of this article).
- For multiple (Re)Actors per thread, our Infrastructure Code (which runs threads, instantiates (Re)Actors, and calls Reactor::react()) can easily put a pointer to the allocator of the current (Re)Actor right before calling the respective Reactor::react().
In addition, we found that the performance penalties of accessing TLS (usually one indirection from a specially designated CPU register into a highly-likely cached piece of memory) are not too high even for our very time-critical code.

Second, we needed to decide whether we want to spend time keeping track of whole pages becoming empty so we can release them. Based on the logic discussed in [NoBugs18], we decided in favor of not spending any effort on keeping track of allocation items as long as there is more than one such item per CPU page.

Third, in particular based on [NoBugs16], we aimed to use as few memory accesses as humanly possible. Indeed, on modern CPUs, registerâ€“register operations (which take ~1 CPU cycle) are pretty much free compared to memory accesses (which can go up to 100+ CPU cycles).

Implementation

We decided to split all our allocations into four groups depending on their size:

â€˜smallâ€™ allocations â€“ up to about one single CPU page.
We decided to handle them as â€˜bucket allocatorsâ€™ (a.k.a. â€˜memory poolsâ€™). Each page contains buckets of the same size; available bucket sizes are some kind of exponent so that we can keep overheads in check.

Whenever a new page for a specific bucket size is allocated, we â€˜formatâ€™ it, creating a linked list of available items in the page, and adding these items to the â€˜bucketâ€™, which is essentially a single-linked list with all the free items of this size.
â€˜mediumâ€™ allocations â€“ those taking just a few pages (currently â€“ up to 4 pages IIRC).
These are also handled as â€˜bucket allocatorsâ€™, but they may span several CPU pages (we named these â€˜multi-pagesâ€™). Note that for â€˜mediumâ€™ allocations, all the allocation items are page-aligned.
â€˜largeâ€™ allocations â€“ those which are already too large for buckets, but which are still too small to request from the OS directly as a single range (doing so would create too many virtual memory areas a.k.a. VMAs, and may result in running out of available VMA space â€“ which is manifested by ENOMEM returned by mmap() even if there is still lots of both of address space and physical RAM). Currently, â€˜largeâ€™ allocations go up to about a few hundred kilobytes in size.
â€˜largeâ€™ allocations are currently handled as good olâ€™ Knuth-like first-fit allocators working at page level (i.e. granularity of allocations is one page), and with some further relatively minor optimizations.

â€˜Largeâ€™ allocations are not aligned at page boundaries (though, of course, theyâ€™re still aligned at 8-byte boundaries).
â€˜very largeâ€™ allocations â€“ those allocations which are large enough to feed them to the OS directly. Currently, they start at about a few hundred kilobytes.
Like â€˜largeâ€™ allocations, â€˜very largeâ€™ allocations are not aligned on page boundaries.

â€˜very largeâ€™ allocations are the only kind of allocations which can be returned back to the OS.

Optimizing allocation â€“ calculating logarithms

Up to now, everything has been fairly obvious. Now, we can get to the interesting part: specifically, what did we do to optimize our allocator? First, letâ€™s note that we spent most of our time optimizing â€˜smallâ€™ and â€˜mediumâ€™ allocations (on the basis that theyâ€™re by far the most popular allocs in most apps, especially in (Re)Actor apps).

The first problem we faced when trying to optimize small/medium allocations was that â€“ given the allocation size, which comes in a call to our malloc() â€“ we need to calculate the bucket number. As our bucket sizes are exponents, this means that effectively we had to calculate an (integer) logarithm of the allocation size.

If we have bucket sizes of 8, 16, 32, 64, â€¦ â€“ then calculating the integer logarithm (more strictly, finding the greatest integer so that two raised to that integer is less or equal to the allocation size) becomes a cinch. For example, on x64 we can/should use a BSR instruction, which is extremely fast. (How to ensure that our code generates a BSR is a different compiler-dependent story but it can be done for all major compilers.) Once we have our BSR, skipping some minor details, we can calculate bucket_number = BSR(size-1)-2, or, in terms of bitwise arithmetic, the ordinal number of the greatest bit set of (size-1) minus two.

However, having bucket sizes double at each step leads to significant overheads, so we decided to go for a â€˜half-exponentâ€™ sequence of 8, [12 omitted due to alignment requirements], 16, 24, 32, 48, 64, â€¦ In this case, the required logarithm to find our bucket size can still be calculated very quickly along quite similar lines: it is a doubled ordinal number of a greatest bit set of (size-1) plus second greatest bit of (size-1) minus five.

These are still register-only operations, are still branch-free, and are still extremely fast. In fact, when we switched to â€˜half-exponentâ€™ buckets, we found that â€“ due to improved locality â€“ the measured speed improved in spite of the extra calculations added.

Optimizing deallocation â€“ placing information in a dereferenceable pointer?!

The key, the whole key, and nothing but the key, so help me Codd~ unknown

Up to now, we have described nothing particularly interesting. It was when we faced the problem of how to do deallocation efficiently that we got into the really interesting stuff.

Whenever we get a free() call, all we have is a pointer, and nothing but a pointer (for C++ delete). And from this single pointer we need to find: (a) whether it is to a â€˜smallâ€™, â€˜mediumâ€™, â€˜largeâ€™, or â€˜very largeâ€™ allocated block, and for small/medium blocks, we have to find (b) which of the buckets it belongs to.

Take 1 â€“ Header for each allocation item

The most obvious (and time-tested) way of handling it is to have an allocated-item header preceding each allocation item, which contains all the necessary information. This works, but requires 2 memory reads (cached ones, but still taking 3 cycles or so each) and, even more importantly, the item header cannot be less than 8 bytes (due to alignment requirements), which means up to twice the overhead for smaller allocation sizes (which also happen to be the most popular ones).

We tried this one, it did work â€“ but we were sure it was possible to do it better.

Take 2 â€“ Dereferenceable pointers and bucket page headers

For our next step, we had two thoughts:

All large/very-large allocated items have values of CPU_page_start+16 (this happens naturally as these items in our implementation always start at page beginning, after a 16-byte header). BTW, â€˜16â€™ is not really a magic number, it is just the size of a large/very-large item header.
We can also ensure that all small/medium pointers never start at CPU_page_start+16. This is assured by the â€˜bucket page formattingâ€™ routine, which, if it runs into such a size, simply skips this one single item (note that it wonâ€™t happen for larger item sizes, so the memory overhead due to such skipping is negligible).

This means that (assuming a 64-bit app and a 4K-page, but for other page sizes the logic is very similar) an expression ((pointer_to_be_freed&0xFFF)==16) will give us an answer to the question of whether weâ€™re freeing a small/medium alloc or a large/very-large alloc.

BTW, this means that we already achieved the supposedly-impossible feat of effectively placing a tiny bit of information into a dereferenceable pointer. In other words, having nothing but the pointer itself (not even accessing the memory the pointer refers to), we can reach conclusions about certain properties of the memory it points to.
And for small/medium allocs, we can exploit the fact that all of the buckets within the same page are of the same size. This means that if we place a header into each page (instead of placing it into each allocated item), weâ€™ll be able to reach it using ((page_header*)(pointer_to_be_freed&0xFFFFFFFFFFFFF000)) â€“ and get information about the bucket number out of our page_header.

This approach worked, but while it reduced memory overhead, the cost of the indirection to the page_header (which was less likely to be cached than the allocation item header) was significant, so we observed minor performance degradation.☹

Take 3 - Storing the bucket number within a dereferencable pointer

However, (fortunately) we didnâ€™t give up - and came up with the following schema, which effectively allows us to extract the bucket number from each small/medium allocated pointer. It requires a bit of explanation.

Whenever weâ€™re allocating a bunch of pages from the OS (via mmap()/VirtualAllocEx()) â€“ we can do it in the following manner:

Letâ€™s assume we have 16 buckets (this can be generalized for a different number of buckets, even for non-power-of-2 ones, but letâ€™s be specific here).
Weâ€™re reserving 16 pages, without committing them (yet). Sure, it does waste a bit of address space â€“ but at least for 64-bit programs it is not really significant; and as weâ€™re not committing, we do not waste any RAM (well, except for an additional VMA, but a number of VMAs have to be addressed separately anyway).
As for the wasting of address space, in the worst possible case such a waste is 16x (it wonâ€™t happen in the real-world, but letâ€™s assume for the moment it did). And while 16x might look a lot, we can observe that modern OSs running under x64 CPUs have 47-bit address spaces; even with the 16x worst-case overhead, we still can physically allocate 2⁴³ bytes of RAM, or 8 Terabytes of RAM â€“ which is still well beyond practical capabilities of any x64 box Iâ€™ve ever heard of (as of this writing, even the largest TPC-E boxes which cost $2 million, use â€˜onlyâ€™ 4T of RAM). If you ever have such a beast at your disposal, weâ€™ll still have to see whether it will need all this memory within a single process. In any case, it is clear that this waste wonâ€™t matter for the vast majority of currently existing systems.
Very basic maths guarantees us that among our reserved 16 pages, there is always exactly one page for which the expression page_start&0xF000 has the value 0x0000, and exactly one page for which the expression page_start&0xF000 has the value 0x1000, and so on all the way up to 0xF000. In other words, while we do not align our reserved page range, we still can rely on having one page with each of 16 possible values of a certain pre-defined expression over a page_start pointer(!).
Now, weâ€™re saying, that we need to allocate buckets for bucket_number 7, so letâ€™s pick the page which has the expression page_start&0xF000 == 0x7000 (as noted above, such a page always exists in our allocated range). Then commit and â€˜formatâ€™ this page to have buckets corresponding to bucket index == 7.
Of course, whenever we need a page for a different bucket size, we can (and should) still re-use those reserved-but-not-yet-committed pages, committing memory for them and formatting them for the sizes which follow from their page_start&0xF000.

After weâ€™re done with this, we can say that:

For each and every â€˜smallâ€™/â€˜mediumâ€™ pointer to be freed, the expression ((pointer_to_be_freed>>12)&0xF) gives us the bucket number.

This information can be extracted purely from the pointer, without any indirections(!). In other words, by doing some magic we did manage to put information about the bucket number into the pointer itself(!!).

In practice, it was a bit more complicated than that (to avoid creating too many VMAs, we needed to reserve/commit pages in larger chunks â€“ such as 8M), but the principles stated above still stand in our implementation.

This approach happened to be the best one both performance-wise and memory-overhead-wise.

How our deallocation works

To put all the pieces of our deallocation together, letâ€™s see how our deallocation routine works:

We take the pointer to be freed (which is fed to us as a parameter of a free() call), and use something along the lines of ((pointer_to_be_freed&0xFFF)==16) to find out if it was small/medium alloc, or large/very-large one. NB: there is a branch here, but large/very-large blocks happen rarely, so mispredictions are rare.
If it is a large/very-large item, weâ€™re using a traditional header-before-allocation-item. As this happens rarely, performance in this branch is not too important (it is fast, but it doesnâ€™t need to be Damn Fastâ„¢).
If it is a small/medium item, we calculate the bucket size using ((pointer_to_be_freed>>12)&0xF) and then simply add the current pointer to be freed to the single-linked list of the free items in this bucket.

This is the most time-critical path â€“ and we got it in a very few operations (maybe even close to â€˜the-least-possibleâ€™). Bingo!

Test results

Of course, all the theorizing about â€˜we have very few memory accessesâ€™ is fine and dandy, but to map them into real world, we have to run some benchmarks. So, after all the optimizations (those above and others, such as forcing the most critical path â€“ and only the most critical path â€“ to be inlined), we ran our own â€˜simulating real-world loadsâ€™ test [Ignatchenko18] and compared our iibmalloc with general-purpose (multithreaded) allocators. We feel that the results we observed for our iibmalloc were well-worth the trouble we took while developing it.

The testing is described in detail in [Ignatchenko18], with just a few pointers here:

We tried to simulate real-world loads, in particular:
- The distribution of allocation sizes is based on p~(1/sz) (where p is the probability of getting allocations of size sz).
- The distribution of life times of allocated items is based on a Pareto distribution.
- Each allocated item is written once and read once.
All 3rd-party allocators are taken from the current Debian â€˜stableâ€™ distribution.
Unless specified otherwise, we ran our tests with total allocation size of 1.3G.
When running multiple threads, the total allocation size was split among threads, so the total allocation size for the whole process remained more or less the same.

As we can see (Figure 1), CPU-wise, we were able to outperform all the allocators at least by 1.5x.

Figure 1

And from the point of view of memory overhead (Figure 2), our iibmalloc has also performed well: its overhead was pretty much on par with the best alloc we have seen overhead-wise (jemalloc) â€“ while significantly outperforming it CPU-wise.

Figure 2

Note that comparison of iibmalloc with other allocs is not a 100% â€˜fairâ€™ comparison: to get these performance gains, we had to give up on support for multi-threading. However, whenever you can afford to keep the Shared-Nothing model (=â€˜sharing by communicating instead of communicating by sharing memoryâ€™), this allocator is likely to improve the performance of malloc-heavy apps.

Another interesting observation can be seen in the graph in Figure 3, which shows results of a different bunch of tests, changing the size of allocated memory.

Figure 3

NB: Figure 3 is for a single thread, which as we seen above is the very best case for tcmalloc; for larger number of threads, tcmalloc will start to lose ground.

On the graph, we can see that when weâ€™re restricting our allocated data set to single-digit-megabytes (so everything is L3-cached and significant parts are L2-cached), then the combined costs of a malloc()/free() pair for our iibmalloc can be as little as 15 CPU clock cycles(!). For a malloc()/free() pair, 15 CPU cycles is a pretty good result, which we expect to be quite challenging to beat (though obviously weâ€™ll be happy if somebody does). On the other hand, as we have spent only a few man-months on our allocator, there is likely quite a bit of room for further improvements.

Conclusions

We presented an allocator which exhibits significant performance gains by giving up multi-threading. We did not really try to compete with other allocators (weâ€™re solving a different task, so it is like comparing apples and oranges); however, we feel that we can confidently say that

For (Re)Actors and message-passing programs in general, it is possible to have a significantly better-performing allocator than a generic multi-threaded one.

As a potentially nice side-effect, we also demonstrated a few (hopefully novel â€“ at least we havenâ€™t run into them before) techniques, such as storing information in dereferenceable pointers, and these techniques might (or might not) happened to be useful for writers of generic allocators too.

References

[Github] https://github.com/node-dot-cpp/iibmalloc

[Ignatchenko16] Sergey Ignatchenko and Dmytro Ivanchykhin (2016) â€˜Ultra-fast Serialization of C++ Objectsâ€™, Overload #136

[Ignatchenko18] Sergey Ignatchenko, Dmytro Ivanchykhin, and Maxim Blashchuk (2018) â€˜Testing Memory Allocators: ptmalloc2 vs tcmalloc vs hoard vs jemalloc While Trying to Simulate Real-World Loadsâ€™, http://ithare.com/testing-memory-allocators-ptmalloc2-tcmalloc-hoard-jemalloc-while-trying-to-simulate-real-world-loads/

[Loganberry04] David â€˜Loganberryâ€™, Frithaes! â€“ an Introduction to Colloquial Lapine!, http://bitsnbobstones.watershipdown.org/lapine/overview.html

[NoBugs16] â€˜Operation Costs in CPU Clock Cyclesâ€™, â€˜No Bugsâ€™ Hare, http://ithare.com/infographics-operation-costs-in-cpu-clock-cycles/

[NoBugs18] â€˜The Curse of External Fragmentation: Relocate or Bust!â€™, â€˜No Bugsâ€™ Hare, http://ithare.com/the-curse-of-external-fragmentation-relocate-or-bust/

Notes:

More fields may be available via dynamicdata ..