Journal Articles

Overload Journal #141 - October 2017 + Programming Topics
Browse in : All > Journals > Overload > o141 (9)
All > Topics > Programming (877)
Any of these categories - All of these categories

Note: when you create a new publication type, the articles module will automatically use the templates user-display-[publicationtype].xt and user-summary-[publicationtype].xt. If those templates do not exist when you try to preview or display a new article, you'll get this warning :-) Please place your own templates in themes/yourtheme/modules/articles . The templates will get the extension .xt there.

Title: â€˜Speedy Gonzales’ Serializing (Re)Actors via Allocators

Author: Bob Schmidt

Date: 05 October 2017 19:28:12 +01:00 or Thu, 05 October 2017 19:28:12 +01:00

Summary: More speed! Sergey Ignatchenko completes his (Re)Actor allocator series with Part III.

Body: 

Start the reactor. Free Mars…
~ Kuato from Total Recall

Disclaimer: as usual, the opinions within this article are those of ‘No Bugs’ Hare, and do not necessarily coincide with the opinions of the translators and Overload editors; also, please keep in mind that translation difficulties from Lapine (like those described in [Loganberry04]) might have prevented an exact translation. In addition, the translator and Overload expressly disclaim all responsibility from any action or inaction resulting from reading this article.

As we briefly discussed in Part I of this mini-series [NoBugs17a], message-passing technologies such as (Re)Actors (a.k.a. Actors, Reactors, ad hoc FSMs, and event-driven programs) have numerous advantages - ranging from being debuggable (including post-factum production debugging), to providing better overall performance.

In [NoBugs17a] and [NoBugs17b], we discussed an approach to handling allocations for (Re)Actors, and then were able to achieve a safe dialect of C++ (that is, as long as we’re following a set of well-defined local rules). Now, let’s take a look at another task which can be facilitated by per-(Re)Actor allocators - specifically, at the task of serializing (Re)Actors that are later going to be de-serialized by the same executable. While one solution for this task was provided in [Ignatchenko-Ivanchykhin16], the proposed ‘ultra-fast’ serialization is rather cumbersome to maintain, and in most cases it can be beaten performance-wise by serializing at the allocator level.

#define (Re)Actors

To make this article self-contained and make sure that we’re all on the same page with terminology, let’s repeat the definition of our (Re)Actors from [NoBugs17a].

Let’s name a common denominator for all our (Re)Actors a GenericReactor. GenericReactor is just an abstract class – and has a pure virtual function, react():

  class GenericReactor {
    virtual void react(const Event& ev) = 0;
    virtual ~GenericReactor() {}
  }

Let’s name the piece of code which calls GenericReactor’s react() Infrastructure Code. Quite often this call will be within a so-called ‘event loop’ (see Listing 1).

std::unique_ptr<GenericReactor> r 
  = reactorFactory.createReactor(...);

while(true) { //event loop
  Event ev = get_event();
    //from select(), libuv, ...
  r->react(ev);
}

			
Listing 1

Let’s note that the get_event() function can obtain events from wherever we want; from select() (which is quite common for servers) to libraries such as libuv (which is common for clients).

Also let’s note that an event loop such as the one above is by far not the only way to call react(): I’ve seen implementations of Infrastructure Code ranging from the one running multiple (Re)Actors within the same thread, to another one which deserialized the (Re)Actor from a database (DB), then called react(), and then serialized the (Re)Actor back to the DB. What’s important, though, is that even if react() can be called from different threads, it must be called as if it is one single thread. In other words, if necessary, all thread sync should be done outside of our (Re)Actor, so react() doesn’t need to bother about thread sync regardless of the Infrastructure Code in use.

Finally, let’s name any specific derivative from Generic Reactor (which actually implements our react() function) a SpecificReactor:

  class SpecificReactor : public GenericReactor {
    void react(const Event& ev) override;
  };

In addition, let’s observe that whenever the (Re)Actor needs to communicate with another (Re)Actor then – adhering to the ‘Do not communicate by sharing memory; instead, share memory by communicating’ principle – it merely sends a message, and it is only this message which will be shared between (Re)Actors. In turn, this means that we can (and should) use single-threaded allocation for all (Re)Actor purposes – except for allocation of those messages intended for inter-(Re)Actor communications.

Task: Serialization for the same executable

Now, let’s define what we’re going to do with our (Re)Actor in this article (and why). Basically, as discussed in [Ignatchenko-Ivanchykhin16], when dealing with (Re)Actors, we often need to serialize the current state of our (Re)Actor so that we can deserialize it in exactly the same executable (though this executable can run on a completely different computer etc.). Applications of such ‘serialization for exactly the same executable’ are numerous; in particular, it is useful for (see, for example, [NoBugs17c]):

‘Speedy Gonzales’ (Re)actor-level serialization

Now, as we have both our (Re)Actors and our task at hand more-or-less defined, let’s see how well we can do with our per-(Re)Actor allocator.

Actually, it is fairly simple:

Bingo! After such traversing of the whole state of our (Re)Actor is completed, we can be sure that all the pointers to the heap within our (Re)Actor are updated with the new values. In turn, it means that all the pointers to the state are already updated, so all the relocations due to serialization are already handled, and we can proceed with normal (Re)Actor processing.

One very important property of such serialization is that it is extremely difficult to beat this kind of serialization performance-wise. Deserialization is a slightly different story due to potential relocation, but it is still pretty fast. Also, for several of the use cases mentioned in the ‘Task: Serialization for the same Executable’ section above, it is only the performance of serialization which really matters. Indeed, all we need to do is memcpy() for large chunks of RAM, and with memcpy() speeds being of the order of 5-10 gigabytes/second at least for x64 (see, for example, [B14]), this means that even to serialize a (Re)Actor which has 100MB state, we’re talking about times of the order of 10-20ms. Serializing the same thing using conventional serialization methods (even really fast ones, such as the one discussed in [Ignatchenko-Ivanchykhin16]) is going to be significantly slower. The exact numbers depend on the specifics of the organization of the data, but if we have a randomly filled 100MB std::map<int>, just iterating over it without any serializing is going to take the order of 100ms, almost an order of magnitude longer (!).

Parallel serialization aided by (kinda-)copy-on-write

For those apps where even 10-20ms of additional latency per 100MB of state is too much, it might be possible to reduce it further.

One of the implementations would work along the following lines (which are ideologically similar, though not identical to, classic Copy-on-Write):

When we want to serialize, we (in our ‘main’ (Re)Actor processing thread):

The ‘serialization’ thread just takes pages from the list to be serialized one by one, and for each such page:

Whenever we have write access to one of the ‘no-write’ pages, we catch the appropriate CPU-level exception, and within the exception handler:

Check if the page being accessed is already being serialized (this can happen due to races with the ‘serialization’ thread); this should be done in an atomic manner similar to the ‘serialization’ thread as described above

If the page isn’t serialized yet:

That’s it. While with such a processing we do have to copy some of the pages within the main thread (causing some latency), for typical access patterns, this will happen relatively rarely, significantly reducing overall serialization latency observed within the ‘main’ (Re)Actor thread. For example, if out of our 100MB (~=25K pages) (Re)Actor state, only 1000 pages are modified during our 20ms serialization – then the latency cost of the serialization will drop approximately by a factor of 25x, bringing observable latency to around 1ms (which is acceptable for the vast majority of the systems out there, even for first-person-shooter games).

Per-(re)actor allocators in a usable manner

Up to now, we are merely assuming that our allocators can be made per-(Re)Actor; one obvious way of doing this is to have our (Re)Actor code specify our own allocator for each and every allocation within our (Re)Actor (we’ll need to cover both explicit calls to new, and all implicit allocations such as collections).

While such a naïve approach would work in theory, it is way too inconvenient to be used in practice. Fortunately, changing an allocator to a per-(Re)Actor one happens to be possible without any changes to the (Re)Actor code. In particular, it can be done along the following lines.

First, we replace malloc()/free() (Important: make sure that your global ::operator new/::operator delete, and your default std::allocator also use the replaced functions (!). The latter might be rather tricky unless your std library already uses ::operator new()/::operator delete(), but usually it can be take care of; in particular, for GCC, see [GCC]) and the --enable-libstdcxx-allocator option for ./configure of libstdc++.)

To implement our own malloc(), we’re going along the lines of Listing 2. (Of course, free() should go along the same lines.)

thread_local OurAllocator* current_allocator 
  = nullptr;

void* malloc(size_t sz) {
  if( current_allocator )
    return current_allocator->malloc(sz);
  else
    return non_reactor_malloc(sz);
}
			
Listing 2

The point here is that our Infrastructure Code (the one which calls our (Re)Actor) sets the current_allocator pointer before every call to GenericReactor::react() (see Listing 3).

current_allocator = create_new_allocator();
std::unique_ptr<GenericReactor> r
  = reactorFactory.createReactor
  (current_allocator,...);
current_allocator = nullptr;

while(true) { //event loop
  Event ev = get_event();
    //from select(), libuv, ...
  current_allocator = r->allocator;
  r->react(ev);
  current_allocator = nullptr;
}

current_allocator = r->allocator;
r.reset();
current_allocator = nullptr;
			
Listing 3

Of course, this is a kind of trick – but it will work. Very briefly: first, we confine our current_allocator variable to the current thread by using thread_local, and then within this single thread, we can easily control which allocator is currently used by simple assignments within our Infrastructure Code. One thing to remember when using this way is to make sure that we set current_allocator before each and every method call of our (Re)Actor (including its constructor and destructor(!)).

That’s it: we’ve made our (Re)Actor use a per-(Re)Actor allocator – and without changing a single line within our (Re)Actor’s code too ☺.

Summary

To summarize this part III of the mini-series on ‘Allocators for (Re)Actors’:

And as this part III concludes this mini-series, let’s summarize all our findings (from all three parts).

Part I

Part II

By adding a few more of easily understood guidelines, we can extend our ‘kinda-Safe’ mode from Part I into ‘really safe’ C++ dialect.

All the guidelines/rules we need to follow are local, which enables reasonable tool-aided enforcement and allows to keep code maintainable.

Part III

Phew. It was rather long mini-series, but I hope I have made my points about the significant advantages of allocators specialized for (Re)Actor purposes reasonably clear.

References

[B14] Thomas B, ‘Single-threaded memory performance for dual socket Xeon E5-* systems’, https://software.intel.com/en-us/forums/software-tuning-performance-optimization-platform-monitoring/topic/509237

[GCC] ‘The GNU C++ Library Manual’, Chapter 6, https://gcc.gnu.org/onlinedocs/libstdc++/manual/memory.html#allocator.ext

[Ignatchenko-Ivanchykhin16] Sergey Ignatchenko and Dmytro Ivanchykhin, ‘Ultra-fast Serialization of C++ Objects’, Overload #136, Dec 2016

[Loganberry04] David ‘Loganberry’, Frithaes! – an Introduction to Colloquial Lapine!, http://bitsnbobstones.watershipdown.org/lapine/overview.html

[NoBugs17a] ‘No Bugs’ Hare, ‘Allocator for (Re)Actors with Optional Kinda-Safety and Relocation’, Overload #139, Jun 2017

[NoBugs17b] ‘No Bugs’ Hare, ‘A Usable C++ Dialect that is Safe Against Memory Corruption’, Overload #140, Aug 2017

[NoBugs17c] ‘No Bugs’ Hare, ‘Deterministic Components for Interactive Distributed Systems’, ACCU2017, available at http://ithare.com/deterministic-components-for-interactive-distributed-systems-with-transcript/

Notes: 

More fields may be available via dynamicdata ..