Journal Articles

Overload Journal #137 - February 2017 + Programming Topics + Process Topics
Browse in : All > Journals > Overload > o137 (7)
All > Topics > Programming (877)
All > Topics > Process (83)
Any of these categories - All of these categories

Note: when you create a new publication type, the articles module will automatically use the templates user-display-[publicationtype].xt and user-summary-[publicationtype].xt. If those templates do not exist when you try to preview or display a new article, you'll get this warning :-) Please place your own templates in themes/yourtheme/modules/articles . The templates will get the extension .xt there.

Title: The Importance of Back-of-Envelope Estimates

Author: Martin Moene

Date: 05 February 2017 15:55:06 +00:00 or Sun, 05 February 2017 15:55:06 +00:00

Summary: Guestimate questions make many people grumble. Sergey Ignatchenko reminds us why they matter.

Body: 

Disclaimer: as usual, the opinions within this article are those of ‘No Bugs’ Hare, and do not necessarily coincide with the opinions of the translators and Overload editors; also, please keep in mind that translation difficulties from Lapine (like those described in [Loganberry04]) might have prevented an exact translation. In addition, the translator and Overload expressly disclaim all responsibility from any action or inaction resulting from reading this article.

With all the techniques we use during development (ranging from ‘just keyboard and vim’ to ‘RAD IDE which claims to make writing code unnecessary’), there is one thing which is unfortunately meticulously neglected across the software development industry (and moreover, there are arguments pushed forward that it is a Bad Thing even to try using it). I’m speaking about order-of-magnitude estimates made on the back of an envelope. While it is impossible to provide any strict proof that such estimates are useful, we’ll see some anecdotal kinda-evidence that such estimates do help us to avoid spending months (and even years) on development which is apparently unnecessary or outright infeasible.

BTW, the subject of back-of-the-envelope estimates in IT is not something new (it was discussed at least in [McConnell] and [Atwood], and recently mentioned in [Acton]); indeed, I myself am guilty of using it in [NoBugs12] and [NoBugs13] ☺. However, with such estimates being so important and so frequently neglected, I am sure it is necessary to emphasize their importance once again (and again, and again ;-)).

Definition and examples from physics

First of all, let’s define what we’re talking about. Without going into lengthy discussion, let’s just quote Wikipedia:

In the hard sciences, back-of-the-envelope calculation is often associated with physicist Enrico Fermi, who was well known for emphasizing ways that complex scientific equations could be approximated within an order of magnitude using simple calculations.

Besides noting that with Enrico Fermi being one of the brightest minds of the XX century, we’re certainly in a good company ;-), we should emphasize this ‘within an order of magnitude’ qualifier. In other words, we’re not even trying to get results which can be seen as a replacement for benchmarking. On the other hand, if our simplified calculations can give us an order of magnitude that is appropriate for approximating the exact value we need, we don’t have to spend time performing an actual experiment.

Two most famous examples of back-of-the-envelope calculations in physics are the following (with lots and lots of others not recorded as they’re not seen as anything special):

Estimating CPU and RAM usage

Well, as most of us are not physicists, and are more like software developers and/or admins, let’s take a look at some real-world examples from IT. While (of course), any such examples at best count as anecdotal evidence, they still illustrate the tremendous practical value which can come out of them.

My first family of examples are about estimating CPU and RAM usage. Actually, examples of such calculations are so routine that they come and go without being noticed. Still, I managed to remember four recent ones which I encountered within the last two months.

The first example was about CPU. In an otherwise very good and insightful presentation (which I won’t name exactly since it was very good and insightful), an example was provided where a certain suboptimality had led to memory reads of 1Megabyte per 10,000 frames of the video game – instead of the 1 Kbyte which was really necessary. As it was noted (and perfectly correctly too), this corresponds to 99.9% of CPU memory bandwidth being wasted. What wasn’t mentioned, however, is that:

BTW, let’s note that I am not arguing with the main point of the presentation – that you can easily waste a lot of memory bandwidth; it is just in this specific example, trying to optimize out a 3e-7 performance hit is very rarely worth the trouble.

The second example was also about CPU. In a high-performance event-driven machine handling TCP/UDP packets coming over the Ethernet, the question has arisen whether it is ok to use virtual functions to implement event dispatch. A very quick calculation on the back of the envelope has shown that:

On the other hand, the picture would be very different if we were talking about virtualizing some operation which is done 100 times per packet (or about handling the 10GBit/s link, but the latter is quite difficult regardless of the virtual function calls). The whole point is about the numbers in specific scenarios – and certainly not about blanket statements such as ‘virtual functions costs are {negligible|very high} regardless of the context’.

The third example of back-of-the-envelope estimates which I encountered in recent months was about the cost of exceptions vs the cost of the return-and-check of the error code. The cost of modern implementations of exception handling in C++ are ‘zero-cost when no exception happens’ – but they cost thousands of CPU cycles when they do happen. On the other hand, an old-fashioned return of the error code from the function being checked by the caller will cost some single-digit CPU cycles on each function call even if nothing wrong happens. When we combine these two numbers together, we realize that performance-wise exceptions will beat return-and-checks as soon as there is 10K function calls per one exception; on the other hand, having one exception per 100 function calls is probably detrimental to performance. Anything between 100 and 10K function calls is too close to call – but on the other hand, the performance difference probably won’t be too drastic regardless of the approach chosen. NB: I am intentionally ignoring the other advantages of exceptions at the moment; for the purposes of this article, the whole discussion is about performance, and only about performance.

The fourth example (and final among those which I’ve run into during the last month or two) was about RAM. In the context of app-level caching of USERS from a database, I was asked “Hey, if we try to cache a million users, will they fit in memory?” My very simple analysis went along the following lines:

Overall, as we can see, there are quite a few situations when back-of-the-envelope (a.k.a. order of magnitude) analysis can provide very valuable information even before we start implementing our system. Sure, such calculations are not a substitute for benchmarking, but they can save us lots and lots of trouble either (a) trying to implement things which won’t be able to cope with the load, or (b) trying to do premature optimizations which aim to eliminate problems which will never really bite us.

Estimating MTBF

While estimating CPU and RAM usage is very important, they’re certainly not the only cases when calculations on the back of the envelope come in handy.

Quite recently, I was involved in quite an interesting analysis related to making systems redundant. Lots of people out there tend to postulate that ‘Hey, to be reliable your system needs to be redundant, period!’; however, under a more thorough analysis this claim fails pretty badly in quite a significant number of real-world scenarios.

First of all, let’s note that when talking about reliability, it is only MTBF (= Mean Time Between Failures) which matters; in other words,

if comparing two systems – one being redundant but failing every month, and another being non-redundant but failing every year – I will clearly pick the second one every day of the week.2

Ok, sure we should note that there are other factors which may need to be considered within the context of reliability (in particular, the way in which the system fails; there is quite a significant difference between losing all the business data, having a ‘split-brain’, or just needing a restart). On the other hand, all these things we’re really interested in are inherently observable ones; as redundancy is not an observable property, it is merely an implementation detail to implement those observable MTBFs etc.

Now, let’s consider an OLTP database running on a larger 4S/4U box; as noted in [NoBugs16a]. Such a box (with proper optimizations) can be able to run up to dozens of billions transactions/year, and this is quite a substantial number for quite a few systems out there.

With 4S/4U boxes having typical MTBFs of 3–5 years, the next question we should ask ourselves, is “Hey, will we really be able to write software which crashes much more rarely than that?” If not (and for any business software, the chance of failures being much more frequent than once every 3–5 years is overwhelmingly high), then the whole point of being redundant pretty much goes away; in other words, it can easily be the case that spending time making the system redundant is not efficient, and that spending the same time on things such as code reviews can lead to better overall reliability.

On MTBFs of redundant systems

As a side note – actually, when calculating MTBFs of a redundant system with 2 nodes, the numbers are not as good as they might seem on first glance. Let’s consider a redundant system Z consisting of 2 redundant components X and Y. Now we need to introduce MDT (= Mean Down Time), which is the mean time between the node going down; MDT is usually measured in hours and usually ranges from 8 hours to several days. (Note that MTTR (= Mean Time To Repair), while closely related to MDT, doesn’t normally include such things as ‘How to get the replacement part to your datacenter’ – and so is not directly usable for our MTBF analysis.)

Let’s note that the maths below, while perfectly common and accepted (see, for example, Chapter 8 in [Smith]) is using quite a few implicit assumptions, which are beyond the scope of our exercise. In particular, it assumes that (a) MDTs are negligible compared to MTBFs, and (b) that failure probabilities (inverse of MTBFs) can be added together (i.e. that failure probabilities are small enough to say that non-linear effects when adding probabilities, are negligible).

Note that all these assumptions, while potentially confusing, do stand in most real-world situations. What we’ll be concentrating on is a different implicit assumption – the one which doesn’t usually stand ☹.

//WARNING: INVALID IMPLICIT ASSUMPTION AHEAD

At this point, it is common to say (erroneously! See below) that redundant system Z will fail if and only if one of the following scenarios happen: (a) after component X fails, component Y will fail within MDTx (i.e. while component X is still being repaired); or (b) after component Y fails, component X will fail within MDTy (i.e. while component Y is still being repaired). The probability of such a failure of component Y within the MDTx, assuming that MTBFs are large, and MDTs are relatively small compared to MTBFs, is

Pyx = 1/MTBFy * MDTx

NB: relying on assumption (a) above

It means that MTBFz can be calculated as

MTBFzincorrect
= 1 / ( 1 / MTBFx * Pyx + 1 / MTBFy * Pyx )
= 1 / ( 1 / MTBFx * 1/MTBFy * MDTx + 1 / MTBFy * 1/MTBFx * MDTy )
= MTBFx * MTBFy / (MDTx + MDTy)

NB: relying on assumption (b) above

//END OF INVALID IMPLICIT ASSUMPTION

It looks all grand and dandy (and with typical MTBFs of 3–5 years and MDTs being maximum 3–5 days, we’d have an MTBFzincorrect of thousands of years – wow!) – until we notice that there is one thing which is utterly unaccounted for in the formula above: it is the MTBF of the redundancy system itself. Let’s name it MTBFr.

Practically, MTBFr needs to cover all the components which form the redundancy system itself. Just one example: if our system uses a ‘heartbeat’ Ethernet cable between two nodes to detect failure of the other node, then failure of this cable is likely to lead to all kinds of trouble (including extremely disastrous ‘split-brain’ failures), and so it needs to be accounted for in MTBFr. In a similar manner, network cards (and their respective drivers(!)) serving this ‘heartbeat’ cable, also need to be included into MTBFr. Moreover, if this cable and NICs are made redundant (which would be quite unusual, but is certainly doable), they will still have their respective MTBFr, and moreover there will be some kind of software (or, Linus forbid, drivers) handling this redundancy, which will also have its own MTBFr. And so on, and so forth.

With MTBFr in mind (and realizing that whenever redundancy system itself fails – the whole thing will fail too) – MTBFzcorrect can be written as

MTBFzcorrect = 1 / (1/ MTBFzincorrect + 1/ MTBFr). (*)

How large your MTBFr is depends, but I can assure you that for the vast majority of real-world cases, it will be much smaller than those hyper-optimistic ‘thousands of years’.

Moreover, according to the formula (*) above, if MTBFr is smaller than MTBFx and MTBFy, adding redundancy makes things worse than it was without any redundancy.

And in practice (and especially whenever redundancy is implemented in software), MTBFr can be easily much smaller than MTBFx. For example, if MTBFr is 1 month (BTW, it is not a joke, I’ve seen quite a few redundancy systems which exhibited less-than-a-week MTBFs under serious loads) while having MTBFx at 3–5 years – the formula (*) will show that MTBFzcorrect is about 0.9999 of MTBFr (i.e. much smaller than original non-redundant MTBFx).

Redundancy estimates – sanity checks

As a nice side effect, our formula of MTBFzcorrect also explains an apparent discrepancy of theoretically predicted MTBFs (those calculated in a manner similar to the calculation of MTBFzincorrect), and realistic numbers – especially if redundancy is implemented in software. Just two real-world stories in this regard.

Once upon a time (around 1997 or so), Stratus (one of the leaders of really serious redundant systems aimed at military, nuclear stations etc. – and generally performing as promised) decided to produce a software-based RADIO cluster. Essentially, the RADIO cluster was just two Windows boxes with a fault-tolerant logic implemented in software; using MTBFzincorrect-like calculations, its MTBF was in hundreds of years. It was a rather impressive system (well, all the Stratuses are quite impressive) – that is, until RADIO crashed with a dreaded Blue Screen Of Death (coming from redundancy driver) just half an hour into testing ☹; and it wasn’t a fluke: under any kind of decent load, RADIO kept crashing several times a day. So much for very-nicely-looking MTBFs calculated along the lines of MTBFzincorrect. RADIO was discontinued really soon, and to the best of my knowledge, Stratus has given up on software-based redundancy at least for a long while. BTW, for hardware-based Stratuses MTBF is indeed in hundreds of years.

In a completely separate story, at some point a rather seriously loaded system was undergoing pre-IPO technical audit. Out of the whole audit, two points are of interest to our discussion:

At that point, the team wasn’t able to articulate the formulae and back-of-the-envelope estimates discussed above to explain the whole thing; however, as of now, feel free to use it as an argument with all kinds of the auditors (and management) insisting on redundancy of all the mission-critical boxes. Note, though, that this logic stands only if both of the following two observations stand for your system:

Estimating network traffic and RTT

Last but not least, back-of-the-envelope estimates often come handy in the context of network programming.

One rather typical case for traffic calculations occurs when we want to push small pieces of information to the clients all the time; this happens all the time in the context of a game (and IMNSHO, games include stock exchanges, etc.).

Let’s consider an example where we need to calculate traffic for an RPG game server which handles 10K players, with each of the players receiving updates about 10 other players and 50 objects (those which are nearby, so the player can see them), with each of players and objects described with three coordinates and a quaternion (the latter to describe rotation), and the updates are sent 60 times per second (at the frame rate of the game client). In this case, if we’re encoding each coordinate and each angle as an 8-byte double, we’ll find that:

On the other hand, if we:

we can easily reduce our traffic to:

While it is still quite a substantial number, it is roughly 36x smaller than before, so at least our traffic costs went down quite a bit ☺. Moreover, while this is very difficult to account for in advance, rounded fixed-point numbers are usually compressed significantly better than full-scaled doubles, which will usually allow further reduction of the traffic.

Another case for back-of-the-envelope estimates in the context of network programming is related to round-trip times (routinely abbreviated as RTT). An example which I want to describe in this regard is a rather long real-world story.

Once upon a time, there was a Really Big Company. And developers there were tasked with developing a system so that companies around the globe (in particular, from the US and Europe) can collaborate. And the guys were tasked with doing it over CORBA as a Big Fat Business Requirement (please don’t tell me that CORBA is actually an implementation detail so it doesn’t belong in Business Requirements; of course it is, but this kind of stuff does happen, especially in Really Big Companies).

And CORBA of that time (which didn’t support the concept of passing objects by value rather than by reference) had an ideology that you create a remote object, and then you call a remote method to add an item to a certain list in the remote object, and then you call a remote method to set a field in the item of the remote object you just created, and so on. And for several of the forms within the program – well, the number of fields combined over various items has reached thousands(!); with CORBA, this meant that when a user presses the ‘submit’ button, there will be thousands of those remote calls.

The guys from the Really Big Company have built the system along the lines above, and they even tested it in LAN. However, when they tried to deploy it over the Atlantic (which was their primary use case), submitting certain forms started to take up to 20 minutes(!), which was obviously unacceptable. Sure, the system was fixed (by removing those roundtrips, which required rewriting like a half of the system and took half a year or so).

What matters, from our current perspective, is that if the guys had performed a back-of-the-envelope estimate in advance, they would see that:

Sure, in practice it was even worse than that – but even 15 minutes would be bad enough ☹ to see that the whole model is not really feasible. Realizing this in advance would save those guys (and that Really Big Company) quite a bit of wasted effort – and quite a bit of embarrassment too.

The importance of sanity checks

I am terribly sorry, but here your calculations are wrong by EIGHT orders of magnitude, so I cannot give you a credit for this lab.
~ overheard during my uni physics courses

One all-important precaution which needs to be done whenever you’re trying to use back-of-the-envelope estimates (or actually, any kind of calculations, but this is a subject for a separate discussion) is making sanity checks. When you have your estimate, you at least need to try to think whether it makes sense given your experience; however, if you’re going to use the estimate to make any serious decisions, you should try to get some other way to double-check your estimate. BTW, in quite a few real-world scenarios, a back-of-the-envelope estimate can be double-checked by another – independent(!) – back-of-the-envelope estimate.

Benchmarks vs back-of-the-envelope estimates

When trying to convince various people to do some back-of-the-envelope estimates, I’ve often run into arguments such as ‘Hey, don’t even try this, the only way to get anything meaningful is to try it and benchmark it properly in the context of your specific task’.

I’m not going to argue with this statement, it is indeed a good advice – that is, if trying and benchmarking is feasible. However, especially at the earlier stages in development, trying certain things can be extremely expensive; this is when back-of-the-envelope estimates come in really handy.

Back-of-the-Envelope Estimates are not a replacement for benchmarks; instead, they’re prerequisites to implementing systems which can be used for benchmarking.

Conclusions

Summarizing all the anecdotal kinda-evidence above, I am comfortable to state the following.

On the one hand, back-of-the-envelope calculations are all-important at least at architectural stages of the project. In particular, they often enable avoiding implementing things which cannot possibly fly for various reasons – and on the other hand, allow avoidance of premature optimizations which will never be important enough for the task in hand.

On the other hand, back-of-the-envelope calculations are just one of the instruments available, and, unfortunately, they’re not the most reliable one ☹.

Make sure to double-check back-of-the-envelope estimates, and to test them, and to benchmark things – as long as it is feasible, that is.

References

[Acton] Mike Acton, ‘Data-Oriented Design and C++’, CppCon 2014, https://www.youtube.com/watch?v=rX0ItVEVjHc

[Atwood] Jeff Atwood, How Good an Estimator Are You?, https://blog.codinghorror.com/how-good-an-estimator-are-you/

[Loganberry04] David ‘Loganberry’, Frithaes! - an Introduction to Colloquial Lapine!, http://bitsnbobstones.watershipdown.org/lapine/overview.html

[McConnell] Steve McConnell. Software Estimation: Demystifying the Black Art (Developer Best Practices). https://www.amazon.com/exec/obidos/ASIN/0735605351/codihorr-20

[NoBugs12] ‘No Bugs’ Hare, 640K 2256 Bytes of Memory is More than Anyone Would Ever Need Get, Overload #112, 2012

[NoBugs13] ‘No Bugs’ Hare, Hard Upper Limit on Memory Latency, Overload #116, 2013

[NoBugs16] ‘No Bugs' Hare, Infographics: Operation Costs in CPU Clock Cycles, http://ithare.com/infographics-operation-costs-in-cpu-clock-cycles/

[NoBugs16a] ‘No Bugs’ Hare, Gradual OLTP DB Development - from Zero to 10 Billion Transactions per Year and Beyond, http://ithare.com/gradual-oltp-db-development-from-zero-to-10-billion-transactions-per-year-and-beyond/

[Smith] Reliability, Maintainability and Risk. Dr David J. Smith. ISBN 978-0080969022

Acknowledgement

Cartoon by Sergey Gordeev from Gordeev Animation Graphics, Prague.

Notes: 

More fields may be available via dynamicdata ..