Journal Articles

Overload Journal #133 - June 2016 + Programming Topics
Browse in : All > Journals > Overload > o133 (8)
All > Topics > Programming (877)
Any of these categories - All of these categories

Note: when you create a new publication type, the articles module will automatically use the templates user-display-[publicationtype].xt and user-summary-[publicationtype].xt. If those templates do not exist when you try to preview or display a new article, you'll get this warning :-) Please place your own templates in themes/yourtheme/modules/articles . The templates will get the extension .xt there.

Title: Deterministic Components for Distributed Systems

Author: Martin Moene

Date: 04 June 2016 14:32:53 +01:00 or Sat, 04 June 2016 14:32:53 +01:00

Summary: Non-deterministic data leads to unstable tests. Sergey Ignatchenko considers when this happens and how to avoid it.

Body: 

Over the last 10 years, significant improvements have been made in program testing. I don’t want to go into a lengthy argument whether TDD is good or bad here – this is not the point of this article. However, one thing I want to note is a rather obvious (so obvious that it is often taken as granted) point that

For test results to have any meaning, the test needs to be reproducible

In theory, we can speak of ‘statistical testing’ when the program is run for 1000 times (and if it fails not more than x times out of these 1000 runs, it is considered fine), but for most of the programs out there it would be a Really Bad Way to test them.

So far so obvious, but what does this really mean in practice? Let’s consider the relationship between two subtly different properties: the ‘test being reproducible’ and the ‘program being deterministic’ (defined as ‘the program has all its outputs completely defined by its inputs’). For the time being, we’ll refrain from answering the question ‘what qualifies as an input’; we’ll come to discussing it a bit later.

First, let’s note that for a 100% deterministic program, all the tests are always reproducible. In other words, a program being deterministic is sufficient to make all the tests against this program reproducible. On the other hand, if all the possible tests for our program are always reproducible, it means that our program is deterministic.

However, there are tests out there which are reproducible even when the program is not entirely deterministic. For example, if our program X prints the current time when we specify the -t option, and it emits concatenation of the input string with ‘beyond any repair’ otherwise, we can say (assuming that we don’t consider current time as one of the program inputs) that:

Let’s see what we have observed up to now:

The second statement can be seen as an indication that all is not lost – we can have meaningful tests for non-deterministic programs. And this does stand; on the other hand, a much more unpleasant statement stands too:

For a real-world non-deterministic program, we are bound to leave some functionality untested

Let’s see this on an example of our program X. We certainly can have a set of reproducible tests over our program – the ones which don’t use the -t option. However, this will leave us with a significant part of our functionality untested.

To avoid this, we can try to test our program with the -t option to emit the right format, but wait – without meddling with the system time we won’t even be able to test it on Feb 29th etc. Ok, we can add manipulating system time to our test, but even then we’ll be testing only the time format (and not correctness of the time which was printed). We may try to measure current time before calling our program, and to compare this value with that printed by our program X; it might seem a good solution, but apparently, even if the program prints time with minute precision and runs for only 0.1 second, from time to time the test will fail. More specifically, such a test will fail whenever this 0.1 second (between the moment when we’ve measured current time and the moment our program has made its own measurement) happens to occur exactly across the boundary between two different minutes. Ok, we could add tolerance (which BTW already makes our test imprecise) and consider the program valid if the printed output printed_time satisfies an equation measured_time < printed_time < measured_time + 0.1s, but even this is not sufficient, as we haven’t accounted for potential system time adjustments (including automated ones).

As we can see, even for an absolutely trivial non-deterministic program, accurate testing becomes a very non-trivial task. In practice, for a non-trivial program writing a set of tests which cover a significant part of the functionality quickly becomes a daunting task (or, more often, leads to significant functionality left untested). And if our program is traditionally multithreaded (defined as ‘multithreaded with explicit thread sync using mutexes or equivalent’) testing very quickly becomes pretty much pointless exactly because of the lack of determinism: the program which works perfectly on your test rig can easily fail on some other computer (or fail on the same computer from time to time) – just because whoever runs it was unlucky. Bummer.

Distributed programs and unit-tests: false sense of security

Let’s come back from our so-far-theoretical clouds to the earth of programming. Let me tell you one real-world story in this regard (let’s call it ‘Unit Test Horror Story’ for future reference).

Once upon a time, there was a company which successfully ran a multi-million-dollar online business. Moreover, they enjoyed very little unplanned downtime by their industry standards (1 hour per year or so). Then, on a nice sunny day (or a dark cloudy one – it won’t change anything), a new developer came to the company. It just so happened that he was a strong proponent of TDD, so he started writing a unit-test for a new feature, and the test failed; he implemented the feature, and the test succeeded. And then the new developer read a lecture to all those non-TDD developers, saying ‘Now you see how easy it is to write error-free programs!’ Next day, his changes went to production and caused one of those once-per-year downtimes.

The moral of this story is certainly not along the lines of ‘see, TDD is useless’ (IMHO, TDD, when taken in moderation, is quite useful, though it is far from being a silver bullet). The point here is to understand what has caused the downtime; and apparently, it was an unusual sequence of otherwise perfectly valid messages (sometimes such unusual sequences are referred to as ‘races’).

Moreover, from my experience, these

unusual sequences of otherwise perfectly valid messages are by far The Most Difficult To Find Bugs in pretty much any real-world distributed system.

As a consequence (and combined with an observation that such unusual sequences tend to be among the most unexpected for the developer), it means that

Unit tests, while important, are insufficient to ensure the validity of a distributed system

This happens because even if every component of your program is deterministic, when you make a distributed system out of these components, your distributed system (taken as a whole) is very unlikely to be deterministic. However, as we’ll see below, even per-component determinism helps a lot for debugging such distributed systems. This includes things such as automated replay-based regression testing, low-latency fault-tolerance and relocation for components, and the holy grail of production post-mortem analysis.

Benefits of deterministic components

Let’s discuss the major benefits of deterministic components in a distributed system.

Replay-based regression testing

As we’ve seen in the ‘Unit Test Horror Story’ above, unit tests are not sufficient to test components of the distributed system. On the other hand, if our component is entirely deterministic, we can do the following:

Of course, you’ve already noticed the weak point of this kind of testing: as noted above, it will work only as long as you have made no changes to existing functionality. This indeed is one of the reasons why this kind of testing is not a silver bullet. However, two things help in this regard:

Deterministic production post-mortem

However thoroughly we test our systems, from time to time they do fail . In such cases, it is of utmost importance to identify the problem ASAP, and to fix it, so it doesn’t happen again.

In this regard, deterministic components can provide a Holy Grail of production post-mortem: an ability to reproduce the bug exactly as it has happened in production. If our component is deterministic, then:

In reality, of course, it is not that simple. If our system runs for months, storing and replaying all the inputs which have happened during those months is rarely feasible. However, we can avoid this problem by running our input log in a circular manner.

As mentioned above, running input log for the entire history of our component rarely qualifies as a viable option. On the other hand, if our component is deterministic, and we can get current state of our component in some kind of serialized form, then we can build our input log as follows:

Such circular input logs are highly practical at least in some deployment scenarios, and provide an ability to see the last t seconds of life of a component before it failed. While it might be that the problem occurred before these last t seconds, and was only manifested later, this technique still happens to be quite useful in practice.

Low-latency fault tolerance and component relocation

Yet another major benefit of having your components both deterministic and serializable is that it is possible to gain some fault tolerance from them (without any help from application level code). In particular, the following schema will work:

Then you can recover from any single server failure in a perfectly transparent manner:

This kind of determinism-based fault tolerance provides fault-tolerance with very little additional latency. In fact, it is ideologically similar to the virtual lockstep way of achieving fault tolerance (and provides significantly better latency than so-called fast checkpoints).

In a somewhat similar manner, it is possible to achieve low-latency relocation of the components from one physical server to another one. The idea revolves around running two instances of the component for some time to reduce latency during serialization/deserialization/state-transfer. The second instance of the component would have its outputs skipped until it has caught up with the first one, and then the first one can be dropped.

Making components deterministic

Ok, I hope that by now I’ve managed to convince you that determinism is a Good Thing™. Now, let’s discuss how to write those deterministic components.

Obvious stuff

First of all, let’s note that there are obvious things to keep in mind when aiming for determinism. In particular:

NB: for the purposes of this article, we won’t aim for cross-platform determinism; while cross-platform determinism is an interesting beast and has its additional benefits, for now we will set it aside to avoid complications. This will save us from quite a few problems, such as iterations over unordered collections and even more nasty different-order-of-floating-point-operations stuff.

Relation to Reactors

Now, let’s note that deterministic components tend to go very well alongside with the Reactor event-driven pattern (as described in [Wikipedia]). This more or less includes such things as Erlang message passing, Node.js, and to certain extent – Akka Actors.

In summary, the Reactor pattern takes incoming inputs (a.k.a. ‘service requests’ or ‘events’) and processes them synchronously, one by one. It also serves as a very convenient point to write all these incoming events into our input log.

So far so good, and if all our Reactor does is calculate a pure function, it stays deterministic just by its very nature. However, as soon as you do as little as call get_current_time() within your component, it ceases to be deterministic . Therefore, we need to discuss ways how to deal with non-determinism.

While the determinism-assuring techniques described below are, strictly speaking, not limited to Reactor and Reactor-like patterns, it is easier to describe some of them in terms of Reactors (and they’re likely to be used in the context of Reactors too).

Considering call output as a program input

As we’ve seen above, even a very simple program which prints current time is not deterministic. However, there is a trick which allows to make it deterministic. More specifically,

If we make current time an input of our program, related non-determinism will go away

In other words, all we need to do to make our program deterministic, is to replace a call to a system-level function get_current_time() with a call to our own function get_our_own_current_time(), where get_our_own_current_time() goes along the following lines:

time_t get_our_own_current_time() {
  switch( deterministic_mode ) {
    case DETERMINISTIC_REPLAY:
      return read_time_t_from_input_log();
    case DETERMINISTIC_RECORDING:
      time_t ret = get_current_time();
        //NON-DETERMINISTIC!
      write_time_t_to_input_log(ret);
        //wrote to input-log
        //to ensure determinism
      return ret;
    case DETERMINISTIC_OFF:
      return get_current_time();
      //NON-DETERMINISTIC!
  }
}

Bingo! We can have our determinism and eat it too!

This trick of declaring something non-deterministic as an input and writing it into input log is pretty much universal in the sense that in theory it can be applied to pretty much any non-deterministic function call including, but not limited to, reading of real random data from /dev/urandom. Even mutexes can be handled this way (though in this case all the data protected by mutex needs to be written to the input log, ouch). On the other hand, in practice there are two significant caveats:

This trick doesn’t go well with fault-tolerant implementations (which need to write all the inputs to a separate physical box)

Guessing game

A different technique to ensure determinism can be described as follows. If we know in advance what will be necessary for the processing of our input event, then we can supply it to our component (logging it to input log in advance, so that it is no longer a problem for fault-tolerant implementations). In particular, as lots of input events need current time, we can say that we’ll provide it to all of them. In this case:

Note that global/per-process singleton won’t be able to ensure determinism if you have more than one thread running your deterministic objects within your process.

Reactor app-level code still needs to call: get_our_own_current_time()instead of: get_current_time()but implementation of get_our_own_current_time() becomes much simpler in this case:

  thread_local time_t event_time;
    //event_time is pre-populated by caller
  time_t get_our_own_current_time() {
    return event_time;
  }

Note that with this implementation, it is not possible to measure execution times within the event handler (as all the calls to get_current_time() for the same event will return the same value). On the other hand, as any kind of execution times measurements would make our program inherently non-deterministic (at the very least when we’re running it on a modern CPU), it is not that big deal. And if you need to make some performance analysis, you still can use something along the lines of Node.js-style console.time()/console.timeEnd(); as these functions do not return the measured time interval value to the program, but rather print it to a log file – then, as long as we do not consider log file as one of program outputs (and the program itself doesn’t read these values from the log file), we’re fine from determinism point of view.

Unfortunately, not all the required inputs can be pre-guessed successfully. However, in quite a few cases, the following technique does help:

This exception-as-a-request-for-non-deterministic-data does help in quite a few scenarios. However, as with anything else, it is not a silver bullet , and chances are that you will need to look for your own ways to ensure determinism.

Conclusions

We’ve discussed the impact of per-component determinism on programming. In particular, we’ve found several major benefits of making your components deterministic (from replay-based regression testing to determinism-based fault tolerance). Also, we’ve discussed some ways of achieving this holy grail of determinism; while techniques discussed here won’t ensure determinism for all programs, they have been seen to allow determinism for quite a few real-world systems (ranging from stock exchanges to online games).

Acknowledgement

Cartoon by Sergey Gordeev from Gordeev Animation Graphics, Prague.

References

[ITHare16] ‘No Bugs’ Hare, ‘Asynchronous Processing for Finite State Machines/Actors: from plain event processing to Futures (with OO and Lambda Call Pyramids in between)’, http://ithare.com/asynchronous-processing-for-finite-state-machines-actors-from-plain-events-to-futures-with-oo-and-lambda-call-pyramids-in-between/

[Wikipedia] Reactor patternhttps://en.wikipedia.org/wiki/Reactor_pattern

Notes: 

More fields may be available via dynamicdata ..