Journal Articles

Overload Journal #143 - February 2018 + Design of applications and programs
Browse in : All > Journals > Overload > o143 (9)
All > Topics > Design (236)
Any of these categories - All of these categories

Note: when you create a new publication type, the articles module will automatically use the templates user-display-[publicationtype].xt and user-summary-[publicationtype].xt. If those templates do not exist when you try to preview or display a new article, you'll get this warning :-) Please place your own templates in themes/yourtheme/modules/articles . The templates will get the extension .xt there.

Title: An MWSR Queue with Minimalist Locking

Author: Bob Schmidt

Date: 07 February 2018 16:45:17 +00:00 or Wed, 07 February 2018 16:45:17 +00:00

Summary: Multithreaded queues come in many flavours. Sergey Ignatchenko describes his implementation of a multiple writer single reader queue.

Body: 

Disclaimer: as usual, the opinions within this article are those of ‘No Bugs’ Hare, and do not necessarily coincide with the opinions of the translators and Overload editors; also, please keep in mind that translation difficulties from Lapine (like those described in [Loganberry04) might have prevented an exact translation. In addition, the translator and Overload expressly disclaim all responsibility from any action or inaction resulting from reading this article.

In [NoBugs17], we discussed the theory behind using CAS (Re)Actors to build multithreaded non-blocking primitives. A very brief recap:

As was mentioned in [NoBugs17], the benefit provided by CAS (Re)Actors is not about the sequence of CPU operations we’re issuing; in theory, exactly the same thing can be produced without (Re)Actors at all. The key benefit is about how we’re thinking about our multithreaded primitive, which tends to provide significant benefits in the complexity that we can handle. Today we’ll demonstrate a practical use of CAS (Re)Actors using one very specific example.

Optimistic concurrency control

OCC assumes that multiple transactions can frequently complete without interfering with each other. While running, transactions use data resources without acquiring locks on those resources. Before committing, each transaction verifies that no other transaction has modified the data it has read. [Wikipedia]

The task at hand

In quite a few rather serious real-world interactive distributed systems (such as stock exchanges and games), I found myself in need of a really fast queue, which had to have to have the following characteristics:

As much as I was in need of such a queue, it turned out to be an extremely difficult task, so that I wasn’t able to devise a system which avoids all the races. I did try to do it three or four times, but each time I found myself going into a vicious cycle of solving one race merely to create another one, and going into this ‘robbing Peter to pay Paul’ mode until I realized the futility of my efforts in that direction :-(. It was even worse as I was sure that a solution did exist – it is just that I wasn’t able to find it.

Enter CAS (Re)Actors

After several attempts at it, writing this queue-with-flow-control became a kind of personal obsession of mine, so there is no surprise that last year I took another shot at it. And as by that time I was spending quite a bit of time formalizing and generalizing my experiences with (Re)Actors, the idea of CAS-size (Re)Actors fortunately came to my mind. Let’s see how the CAS (Re)Actors did allow me to write that elusive MWSR queue with flow-control (we have to be sketchy here, but the whole supposedly-working implementation is available on Github [NoBugs18]).

(Re)Actors

Let’s say that our queue is based on two CAS (Re)Actors, EntranceReactor and ExitReactor:

For sizes of the fields, please refer to [NoBugs18]; however, it should be noted that, under the CAS (Re)Actors paradigm, we can easily use bit fields, so the only thing we care about is the combined bit size of the fields, which shouldn’t exceed the magic number of 128 (that is, for modern x64 CPUs which support the CMPXCHG16B instruction – and that is pretty much any Intel/AMD CPU produced over last 10 years or so).

On the ABA problem

As is well-known in MT programming, and was briefly discussed in [NoBugs17], the so-called ABA problem is one of those things which can easily kill the correctness of a multithreaded primitive. However, it seems that our (Re)Actors are free from ABA-related issues; very briefly:

Overall, from what I can see, our (Re)Actors are ABA-problem free; still, as an ABA problem is one of those bugs which can sit there for ages before manifesting itself, I would certainly appreciate somebody more skilled than me having another look at it.

Locking primitives

In addition to (Re)Actors, we have two locking primitives, both built more or less along the lines of Listing 1.

class LockedSingleThread {
private:
  int lockCount = 0;//MAY be both >0 and <0
  std::mutex mx;
  std::condition_variable cv;
public:
  void lockAndWait() {
    std::unique_lock<std::mutex> lock(mx);
    assert(lockCount == -1 || lockCount == 0);
    lockCount++;
    while (lockCount > 0) {
      cv.wait(lock);
    }
  }
  void unlock() {
    std::unique_lock<std::mutex> lock(mx);
    lockCount--;
    lock.unlock();
    cv.notify_one();
  }
};
			
Listing 1

It is a rather simple (but quite interesting) primitive, with the idea being that whenever some of our (Re)Actors return, we should lock. The corresponding thread calls lockAndWait() and waits on a conditional variable until some other thread calls unlock(). It is important to note that our locking primitives must unlock properly regardless of potential races between a thread being locked and unlock(). In other words, it should work regardless of whether unlock() comes before or after the thread scheduled to be locked reaches lockAndWait().

Putting it all together

Having all four building blocks (two (Re)Actors and two locking primitives), we can write our MWSRQueue (see Listing 2).

template<class QueueItem>
class MWSRQueue {
  static constexpr size_t QueueSize = 64;

private:
  QueueItem items[QueueSize];
  MT_CAS entrance;
  MWSRQueueFC_helpers::LockedThreadsList
    lockedWriters;
  MT_CAS exit;
  MWSRQueueFC_helpers::LockedSingleThread
    lockedReader;

public:
  MWSRQueue();
  void push(QueueItem&& item) {
    EntranceReactorHandle ent(entrance);
    std::pair<bool, uint64_t> ok_id =
      ent.allocateNextID();
    if (ok_id.first) {
      lockedWriters.lockAndWait(ok_id.second);
      ent.unlock();
    }
    size_t idx = index(ok_id.second);
    items[idx] = std::move(item);
    ExitReactorHandle ex(exit);
    bool unlock =
      ex.writeCompleted(ok_id.second);
    if (unlock)
      lockedReader.unlock();
  }
  QueueItem pop() {
    while (true) {
      ExitReactorHandle ex(exit);
      std::pair<size_t, uint64_t> sz_id =
        ex.startRead();
      size_t sz = sz_id.first;
      assert(sz <= QueueSize);
      if (!sz) {
        lockedReader.lockAndWait();
        // unlocking ex is done by
        // ex.writeCompleted()
        continue;//while(true)
      }
      uint64_t id = sz_id.second;
      size_t idx = index(id);
      QueueItem ret = std::move(items[idx]);
      uint64_t newLastW =
        ex.readCompleted(sz,id);
      EntranceReactorHandle ent(entrance);
      bool shouldUnlock =
        ent.moveLastToWrite(newLastW);
      if (shouldUnlock)
        lockedWriters.unlockAllUpTo(id +
          sz - 1 +QueueSize);
      return ret;
    } //while(true)
  }
private:
  size_t index(uint64_t i) {
    return i % QueueSize; //should be fast as
             // long as QueueSize is power of 2
  }
};
			
Listing 2

As we can see, after we have defined our (Re)Actors (including their operations), the whole thing is fairly simple. Within our push() function, we merely:

As for our pop() function, it is only marginally more complicated:

That’s it! We’ve got our MWSRQueue, and with all the desired properties too. In particular, it is an MWSR queue, it does provide flow control, it locks only when it is necessary (on the queue being empty or full), and it is almost-fair (as IDs are assigned in the very first call to the allocateNextID(), the most unfairness which can possibly happen is limited to CAS retries, which are never long in practice).

However, IMNSHO the most important property of the queue is that it was observed to be easily debuggable. After I finished writing the code (which is around 700 LoC of heavily-multithreaded code, and is next-to-impossible to test until the whole thing is completed) and ran the simplistic tests found in the /test/ folder within [NoBugs18], there were, of course, bugs (like a dozen of them). And for multithreaded programs in general, debugging is a well-known nightmare (in particular, because (a) a bug manifests itself in a different place on different runs, and (b) adding tracing can easily change things too much so the bug won’t manifest itself anymore (!)). However, this specific queue happened to be debuggable very easily:

I was able to debug it within half a day.

I contend that anybody who has tried to debug multithreading programs of comparable complexity will realize how fast half-a-day is for this kind of not-so-trivial multithreading.

Maintainability

After it started to work, I ran some experiments, and found that with one single writer, it performed great: with real 128-bit CAS, I measured an upper bound performance of this queue at about 130 nanoseconds per push()+pop() pair. However, with more than one writer, performance was observed to degrade very severely (around 50 ×(!)).

After thinking about it for a few hours, I realized that, actually, the code used in the examples above can be improved a lot – in particular, we can (and should) avoid going into thread-sync stuff on each and every call to pop(). Indeed, as our queue can handle up to 64 slots at a time, we can read all of them into some kind of a ‘read cache’ (with proper synchronization), but then in subsequent calls to pop() we can easily read all the cached values without any thread sync involved. This optimization allowed me to improve performance in tests with two writers by over 50 × (so that performance with two writers became about the same as performance with one single writer). BTW, if you want to see the code with this ‘read cache’ optimization, it is a part of current implementation in [NoBugs18].

However, my main point here is not about the performance of this particular queue. What I want to emphasize is that:

Of course, a lot of further optimizations are possible for this queue (in particular, I am thinking of introducing ‘write caches’ along the lines of the ‘read cache’ above); still, even the current (not perfectly optimized) version in [NoBugs18] seems to perform pretty well under close-to-real-world usage patterns. On the other hand, please treat the code in [NoBugs18] as highly experimental, and be sure to test it very thoroughly before using it in any kind of production; multithreading bugs are sneaky, and there is always a chance that one of them did manage to hide within, in spite of all the reliability improvements provided by CAS (Re)Actors.

Conclusions

We have demonstrated how the real-world task of ‘creating an MWSR queue with flow control and minimal locking’ can be implemented using the concept of CAS (Re)Actors (which was discussed in detail in [NoBugs17]).

In the process, it was also observed that

Not only do CAS (Re)Actors allow us to write multithreaded programs very easily (by the standards of multithreaded programs, that is), but also CAS (Re)Actor-based programs are easily maintainable and easily optimizable.

As a nice side effect ;-), we also wrote a practically-usable MWSR queue with flow control and minimalistic locking, which can take as little as 120 nanoseconds per push()+pop() pair :-).

References

[Loganberry04] David ‘Loganberry’, Frithaes! – an Introduction to Colloquial Lapine!, http://bitsnbobstones.watershipdown.org/lapine/overview.html

[NoBugs17] ‘No Bugs’ Hare, CAS (Re)Actor for Non-Blocking Multithreaded Primitives, Overload #142, December 2017

[NoBugs18] ‘No Bugs’ Hare, mtprimitives, https://github.com/ITHare/mtprimitives/tree/master/src

[Wikipedia] Optimistic concurrency control (OCC)https://en.wikipedia.org/wiki/Optimistic_concurrency_control

Acknowledgement

Cartoon by Sergey Gordeev from Gordeev Animation Graphics, Prague

Notes: 

More fields may be available via dynamicdata ..