Journal Articles
Browse in : |
All
> Journals
> Overload
> o137
(7)
All > Topics > Programming (877) All > Topics > Process (83) Any of these categories - All of these categories |
Note: when you create a new publication type, the articles module will automatically use the templates user-display-[publicationtype].xt and user-summary-[publicationtype].xt. If those templates do not exist when you try to preview or display a new article, you'll get this warning :-) Please place your own templates in themes/yourtheme/modules/articles . The templates will get the extension .xt there.
Title: The Importance of Back-of-Envelope Estimates
Author: Martin Moene
Date: 05 February 2017 15:55:06 +00:00 or Sun, 05 February 2017 15:55:06 +00:00
Summary: Guestimate questions make many people grumble. Sergey Ignatchenko reminds us why they matter.
Body:
Disclaimer: as usual, the opinions within this article are those of ‘No Bugs’ Hare, and do not necessarily coincide with the opinions of the translators and Overload editors; also, please keep in mind that translation difficulties from Lapine (like those described in [Loganberry04]) might have prevented an exact translation. In addition, the translator and Overload expressly disclaim all responsibility from any action or inaction resulting from reading this article.
With all the techniques we use during development (ranging from ‘just keyboard and vim’ to ‘RAD IDE which claims to make writing code unnecessary’), there is one thing which is unfortunately meticulously neglected across the software development industry (and moreover, there are arguments pushed forward that it is a Bad Thing even to try using it). I’m speaking about order-of-magnitude estimates made on the back of an envelope. While it is impossible to provide any strict proof that such estimates are useful, we’ll see some anecdotal kinda-evidence that such estimates do help us to avoid spending months (and even years) on development which is apparently unnecessary or outright infeasible.
BTW, the subject of back-of-the-envelope estimates in IT is not something new (it was discussed at least in [McConnell] and [Atwood], and recently mentioned in [Acton]); indeed, I myself am guilty of using it in [NoBugs12] and [NoBugs13] ☺. However, with such estimates being so important and so frequently neglected, I am sure it is necessary to emphasize their importance once again (and again, and again ;-)).
Definition and examples from physics
First of all, let’s define what we’re talking about. Without going into lengthy discussion, let’s just quote Wikipedia:
In the hard sciences, back-of-the-envelope calculation is often associated with physicist Enrico Fermi, who was well known for emphasizing ways that complex scientific equations could be approximated within an order of magnitude using simple calculations.
Besides noting that with Enrico Fermi being one of the brightest minds of the XX century, we’re certainly in a good company ;-), we should emphasize this ‘within an order of magnitude’ qualifier. In other words, we’re not even trying to get results which can be seen as a replacement for benchmarking. On the other hand, if our simplified calculations can give us an order of magnitude that is appropriate for approximating the exact value we need, we don’t have to spend time performing an actual experiment.
Two most famous examples of back-of-the-envelope calculations in physics are the following (with lots and lots of others not recorded as they’re not seen as anything special):
- Fermi estimating the yield of an atomic bomb by dropping bits of paper and measuring the distance they were blown by the blast wave (he estimated 10K tons of TNT, the actual result was 18.6K tons). See also his famous ‘Fermi problem’ of ‘how many piano tuners are there in Chicago?’ ☺.
- Arnold Wilkins spending a few hours proving that radio-based ‘death rays’ (claimed to be invented by several influential people, including Guglielmo Marconi and Nicola Tesla) are outright impossible, but using radio waves to detect moving objects is perfectly feasible. This, in turn, led to the development of the radar.
Estimating CPU and RAM usage
Well, as most of us are not physicists, and are more like software developers and/or admins, let’s take a look at some real-world examples from IT. While (of course), any such examples at best count as anecdotal evidence, they still illustrate the tremendous practical value which can come out of them.
My first family of examples are about estimating CPU and RAM usage. Actually, examples of such calculations are so routine that they come and go without being noticed. Still, I managed to remember four recent ones which I encountered within the last two months.
The first example was about CPU. In an otherwise very good and insightful presentation (which I won’t name exactly since it was very good and insightful), an example was provided where a certain suboptimality had led to memory reads of 1Megabyte per 10,000 frames of the video game – instead of the 1 Kbyte which was really necessary. As it was noted (and perfectly correctly too), this corresponds to 99.9% of CPU memory bandwidth being wasted. What wasn’t mentioned, however, is that:
- 1Megabyte of waste per 10,000 frames corresponds to a mere 100 bytes/frame
- at a typical 60 frames/second, this corresponds to a mere 6K bytes/second of memory bandwidth being wasted
- with modern x64 CPU bandwidth being of the order of 20+ GByte/second, we’re talking about the waste of about 3e-7 of the total memory bandwidth.
BTW, let’s note that I am not arguing with the main point of the presentation – that you can easily waste a lot of memory bandwidth; it is just in this specific example, trying to optimize out a 3e-7 performance hit is very rarely worth the trouble.
The second example was also about CPU. In a high-performance event-driven machine handling TCP/UDP packets coming over the Ethernet, the question has arisen whether it is ok to use virtual functions to implement event dispatch. A very quick calculation on the back of the envelope has shown that:
- The minimal size of an Ethernet packet which such a system can possibly receive is Ethernet_Frame_Gap + Ethernet_Preamble + Minimal_Ethernet_Packet_Size, or 12+8+64 bytes = 84 bytes.
- For a 100MBit/s link (which is the limit the system has for other reasons), in the very, very best case, it can get 100Mbit/8 ~= 1.2e7 bytes/second, or (given the 84 bytes/packet) 150K packets/second.
- As it is noted in [NoBugs16], the cost of a virtual function call is usually within 50 CPU cycles
- On the other hand, at 3GHz and with 150K packets/second, we’re talking about 20,000 CPU cycles available for handling each packet.
- The waste of at most 50 CPU cycles – compared to 20K CPU cycles – is about 0.2%. Compared to the trouble of avoiding virtual dispatch, the saving seems too small to bother with.
On the other hand, the picture would be very different if we were talking about virtualizing some operation which is done 100 times per packet (or about handling the 10GBit/s link, but the latter is quite difficult regardless of the virtual function calls). The whole point is about the numbers in specific scenarios – and certainly not about blanket statements such as ‘virtual functions costs are {negligible|very high} regardless of the context’.
The third example of back-of-the-envelope estimates which I encountered in recent months was about the cost of exceptions vs the cost of the return-and-check of the error code. The cost of modern implementations of exception handling in C++ are ‘zero-cost when no exception happens’ – but they cost thousands of CPU cycles when they do happen. On the other hand, an old-fashioned return of the error code from the function being checked by the caller will cost some single-digit CPU cycles on each function call even if nothing wrong happens. When we combine these two numbers together, we realize that performance-wise exceptions will beat return-and-checks as soon as there is 10K function calls per one exception; on the other hand, having one exception per 100 function calls is probably detrimental to performance. Anything between 100 and 10K function calls is too close to call – but on the other hand, the performance difference probably won’t be too drastic regardless of the approach chosen. NB: I am intentionally ignoring the other advantages of exceptions at the moment; for the purposes of this article, the whole discussion is about performance, and only about performance.
The fourth example (and final among those which I’ve run into during the last month or two) was about RAM. In the context of app-level caching of USERS from a database, I was asked “Hey, if we try to cache a million users, will they fit in memory?†My very simple analysis went along the following lines:
- If we are talking about a ‘user’ which is just a current state of the user (and not the history of those things which user has ever done), we’re usually talking about user id, name, hashed password, e-mail address, maybe physical address, and some attributes such as current credit. If we add all these items together, we’ll be talking about some hundreds of bytes. Accounting for all the inefficiencies and overheads, let’s say the upper bound is around 1K bytes/user.
- With this in mind, a million users will take around 1 GByte.
- As this whole discussion was in the context of servers, and as modern ‘workhorse’ 2S/1U servers are able to host up to 512G RAM easily (and modern 4S/4U servers able to host 4+ terabytes1), it means that even 100M users aren’t likely to cause too much trouble. In addition, noting that this is not really the number of all users in the DB, but just those users which need to be cached (which roughly corresponds to ‘users currently on site’) – well, it means that there aren’t that many systems out there (if any) which would have any trouble caching users in the app-level cache.
Overall, as we can see, there are quite a few situations when back-of-the-envelope (a.k.a. order of magnitude) analysis can provide very valuable information even before we start implementing our system. Sure, such calculations are not a substitute for benchmarking, but they can save us lots and lots of trouble either (a) trying to implement things which won’t be able to cope with the load, or (b) trying to do premature optimizations which aim to eliminate problems which will never really bite us.
Estimating MTBF
While estimating CPU and RAM usage is very important, they’re certainly not the only cases when calculations on the back of the envelope come in handy.
Quite recently, I was involved in quite an interesting analysis related to making systems redundant. Lots of people out there tend to postulate that ‘Hey, to be reliable your system needs to be redundant, period!’; however, under a more thorough analysis this claim fails pretty badly in quite a significant number of real-world scenarios.
First of all, let’s note that when talking about reliability, it is only MTBF (= Mean Time Between Failures) which matters; in other words,
if comparing two systems – one being redundant but failing every month, and another being non-redundant but failing every year – I will clearly pick the second one every day of the week.2
Ok, sure we should note that there are other factors which may need to be considered within the context of reliability (in particular, the way in which the system fails; there is quite a significant difference between losing all the business data, having a ‘split-brain’, or just needing a restart). On the other hand, all these things we’re really interested in are inherently observable ones; as redundancy is not an observable property, it is merely an implementation detail to implement those observable MTBFs etc.
Now, let’s consider an OLTP database running on a larger 4S/4U box; as noted in [NoBugs16a]. Such a box (with proper optimizations) can be able to run up to dozens of billions transactions/year, and this is quite a substantial number for quite a few systems out there.
With 4S/4U boxes having typical MTBFs of 3–5 years, the next question we should ask ourselves, is “Hey, will we really be able to write software which crashes much more rarely than that?†If not (and for any business software, the chance of failures being much more frequent than once every 3–5 years is overwhelmingly high), then the whole point of being redundant pretty much goes away; in other words, it can easily be the case that spending time making the system redundant is not efficient, and that spending the same time on things such as code reviews can lead to better overall reliability.
On MTBFs of redundant systems
As a side note – actually, when calculating MTBFs of a redundant system with 2 nodes, the numbers are not as good as they might seem on first glance. Let’s consider a redundant system Z consisting of 2 redundant components X and Y. Now we need to introduce MDT (= Mean Down Time), which is the mean time between the node going down; MDT is usually measured in hours and usually ranges from 8 hours to several days. (Note that MTTR (= Mean Time To Repair), while closely related to MDT, doesn’t normally include such things as ‘How to get the replacement part to your datacenter’ – and so is not directly usable for our MTBF analysis.)
Let’s note that the maths below, while perfectly common and accepted (see, for example, Chapter 8 in [Smith]) is using quite a few implicit assumptions, which are beyond the scope of our exercise. In particular, it assumes that (a) MDTs are negligible compared to MTBFs, and (b) that failure probabilities (inverse of MTBFs) can be added together (i.e. that failure probabilities are small enough to say that non-linear effects when adding probabilities, are negligible).
Note that all these assumptions, while potentially confusing, do stand in most real-world situations. What we’ll be concentrating on is a different implicit assumption – the one which doesn’t usually stand ☹.
//WARNING: INVALID IMPLICIT ASSUMPTION AHEAD
At this point, it is common to say (erroneously! See below) that redundant system Z will fail if and only if one of the following scenarios happen: (a) after component X fails, component Y will fail within MDTx (i.e. while component X is still being repaired); or (b) after component Y fails, component X will fail within MDTy (i.e. while component Y is still being repaired). The probability of such a failure of component Y within the MDTx, assuming that MTBFs are large, and MDTs are relatively small compared to MTBFs, is
Pyx = 1/MTBFy * MDTx
NB: relying on assumption (a) above
It means that MTBFz can be calculated as
MTBFzincorrect
= 1 / ( 1 / MTBFx * Pyx + 1 / MTBFy * Pyx )
= 1 / ( 1 / MTBFx * 1/MTBFy * MDTx + 1 / MTBFy * 1/MTBFx * MDTy )
= MTBFx * MTBFy / (MDTx + MDTy)
NB: relying on assumption (b) above
//END OF INVALID IMPLICIT ASSUMPTION
It looks all grand and dandy (and with typical MTBFs of 3–5 years and MDTs being maximum 3–5 days, we’d have an MTBFzincorrect of thousands of years – wow!) – until we notice that there is one thing which is utterly unaccounted for in the formula above: it is the MTBF of the redundancy system itself. Let’s name it MTBFr.
Practically, MTBFr needs to cover all the components which form the redundancy system itself. Just one example: if our system uses a ‘heartbeat’ Ethernet cable between two nodes to detect failure of the other node, then failure of this cable is likely to lead to all kinds of trouble (including extremely disastrous ‘split-brain’ failures), and so it needs to be accounted for in MTBFr. In a similar manner, network cards (and their respective drivers(!)) serving this ‘heartbeat’ cable, also need to be included into MTBFr. Moreover, if this cable and NICs are made redundant (which would be quite unusual, but is certainly doable), they will still have their respective MTBFr, and moreover there will be some kind of software (or, Linus forbid, drivers) handling this redundancy, which will also have its own MTBFr. And so on, and so forth.
With MTBFr in mind (and realizing that whenever redundancy system itself fails – the whole thing will fail too) – MTBFzcorrect can be written as
MTBFzcorrect = 1 / (1/ MTBFzincorrect + 1/ MTBFr). (*)
How large your MTBFr is depends, but I can assure you that for the vast majority of real-world cases, it will be much smaller than those hyper-optimistic ‘thousands of years’.
Moreover, according to the formula (*) above, if MTBFr is smaller than MTBFx and MTBFy, adding redundancy makes things worse than it was without any redundancy.
And in practice (and especially whenever redundancy is implemented in software), MTBFr can be easily much smaller than MTBFx. For example, if MTBFr is 1 month (BTW, it is not a joke, I’ve seen quite a few redundancy systems which exhibited less-than-a-week MTBFs under serious loads) while having MTBFx at 3–5 years – the formula (*) will show that MTBFzcorrect is about 0.9999 of MTBFr (i.e. much smaller than original non-redundant MTBFx).
Redundancy estimates – sanity checks
As a nice side effect, our formula of MTBFzcorrect also explains an apparent discrepancy of theoretically predicted MTBFs (those calculated in a manner similar to the calculation of MTBFzincorrect), and realistic numbers – especially if redundancy is implemented in software. Just two real-world stories in this regard.
Once upon a time (around 1997 or so), Stratus (one of the leaders of really serious redundant systems aimed at military, nuclear stations etc. – and generally performing as promised) decided to produce a software-based RADIO cluster. Essentially, the RADIO cluster was just two Windows boxes with a fault-tolerant logic implemented in software; using MTBFzincorrect-like calculations, its MTBF was in hundreds of years. It was a rather impressive system (well, all the Stratuses are quite impressive) – that is, until RADIO crashed with a dreaded Blue Screen Of Death (coming from redundancy driver) just half an hour into testing ☹; and it wasn’t a fluke: under any kind of decent load, RADIO kept crashing several times a day. So much for very-nicely-looking MTBFs calculated along the lines of MTBFzincorrect. RADIO was discontinued really soon, and to the best of my knowledge, Stratus has given up on software-based redundancy at least for a long while. BTW, for hardware-based Stratuses MTBF is indeed in hundreds of years.
In a completely separate story, at some point a rather seriously loaded system was undergoing pre-IPO technical audit. Out of the whole audit, two points are of interest to our discussion:
- At some point, the auditor asked about system downtime, and after hearing the answer, he was really surprised; the wording went along the lines of ‘how do you guys manage to achieve unplanned downtimes which are 5x–10x lower than others in the industry?!’ Ok, the full log files were provided, essentially proving that the downtimes-being-much-lower-than-industry-average were real.
- At another point (coming much later in the discussion), the auditor noticed that main DB server of the system runs without redundancy. From this point on, the dialog went along the following lines:
- Auditor: Why don’t you use clusters?
- Team: Why should we?
- Auditor: Because it isn’t reliable without redundancy.
- Team: Actually, the reason why we have unplanned downtimes which are that much lower than industry average is because we do NOT use clusters – and the rest of the industry does.
- The curtain falls.
At that point, the team wasn’t able to articulate the formulae and back-of-the-envelope estimates discussed above to explain the whole thing; however, as of now, feel free to use it as an argument with all kinds of the auditors (and management) insisting on redundancy of all the mission-critical boxes. Note, though, that this logic stands only if both of the following two observations stand for your system:
- The cost of failure can be calculated; in other words, it is not about loss of life, and even not about loss of the whole business.
- the number of boxes involved is very low (like in ‘just one, maybe two’). As the formulae above will show, the cost of redundancy (even if it has an MTBFr as poor as 1 failure per month) is low enough compared to the chance of any one of a thousand of servers failing.
Estimating network traffic and RTT
Last but not least, back-of-the-envelope estimates often come handy in the context of network programming.
One rather typical case for traffic calculations occurs when we want to push small pieces of information to the clients all the time; this happens all the time in the context of a game (and IMNSHO, games include stock exchanges, etc.).
Let’s consider an example where we need to calculate traffic for an RPG game server which handles 10K players, with each of the players receiving updates about 10 other players and 50 objects (those which are nearby, so the player can see them), with each of players and objects described with three coordinates and a quaternion (the latter to describe rotation), and the updates are sent 60 times per second (at the frame rate of the game client). In this case, if we’re encoding each coordinate and each angle as an 8-byte double, we’ll find that:
- Each update will take ((10+50)*7*8) bytes of payload, plus at least 14+28 bytes of Ethernet + IP + UDP headers, totalling to ~3K per update
- Each player will need to receive 3Kbytes / update * 60 updates/second = 180 kBytes/second
- With 10K players/server, it means that per-server traffic will be like 1,8 GByte/second, or 14.5 GBit/second
- Even with typical pricing these days being of the order of $3K / month for 10 Gbit/s, it is going to cost quite a lot.
On the other hand, if we:
- reduce the update frequency (for most RPG out there, 20 updates / second will be perfectly unnoticeable, and 10 updates / second will still probably be fine)
- switch to rounded fixed-point representation for coordinates (which can be as little as 10–12 bits per coordinate; 10 bits is enough to represent 10 cm precision in a 100m radius around our character). It will mean that each coordinate will use 10 bits instead of former 64(!).
- switch to Euler-angle-based fixed-point representation for rotations (with precision of each angle being 10 bits, or 0.3 degree)
we can easily reduce our traffic to:
- ((10+50)*6*1.25)+42 ~= 500 bytes / update
- 500 bytes / update * 10 updates / second = 5 kBytes / second per player
- 5kBytes/second/player * 10K players/server = 50 Mbytes / second, or 400 Mbit/second
While it is still quite a substantial number, it is roughly 36x smaller than before, so at least our traffic costs went down quite a bit ☺. Moreover, while this is very difficult to account for in advance, rounded fixed-point numbers are usually compressed significantly better than full-scaled doubles, which will usually allow further reduction of the traffic.
Another case for back-of-the-envelope estimates in the context of network programming is related to round-trip times (routinely abbreviated as RTT). An example which I want to describe in this regard is a rather long real-world story.
Once upon a time, there was a Really Big Company. And developers there were tasked with developing a system so that companies around the globe (in particular, from the US and Europe) can collaborate. And the guys were tasked with doing it over CORBA as a Big Fat Business Requirement (please don’t tell me that CORBA is actually an implementation detail so it doesn’t belong in Business Requirements; of course it is, but this kind of stuff does happen, especially in Really Big Companies).
And CORBA of that time (which didn’t support the concept of passing objects by value rather than by reference) had an ideology that you create a remote object, and then you call a remote method to add an item to a certain list in the remote object, and then you call a remote method to set a field in the item of the remote object you just created, and so on. And for several of the forms within the program – well, the number of fields combined over various items has reached thousands(!); with CORBA, this meant that when a user presses the ‘submit’ button, there will be thousands of those remote calls.
The guys from the Really Big Company have built the system along the lines above, and they even tested it in LAN. However, when they tried to deploy it over the Atlantic (which was their primary use case), submitting certain forms started to take up to 20 minutes(!), which was obviously unacceptable. Sure, the system was fixed (by removing those roundtrips, which required rewriting like a half of the system and took half a year or so).
What matters, from our current perspective, is that if the guys had performed a back-of-the-envelope estimate in advance, they would see that:
- 5,000 fields for all the items in a form, combined with CORBA’s ‘1 remote call per field’ approach, means 5,000 remote calls
- Each remote call incurs at least one round-trip
- Over the trans-Atlantic cable, each remote call takes at the very least 80 ms (the absolute theoretical limit imposed by the speed of light in fibre is 56ms for NY-to-London); taking into account that the program was intended to be used worldwide, we should expect a round-trip time of 200ms at the very least.
- Hence, in the worst-case, 5,000 remote calls would lead to 5,000 round-trips of 200ms each, which translates into a 1000-second delay, or over 15 minutes.
Sure, in practice it was even worse than that – but even 15 minutes would be bad enough ☹ to see that the whole model is not really feasible. Realizing this in advance would save those guys (and that Really Big Company) quite a bit of wasted effort – and quite a bit of embarrassment too.
The importance of sanity checks
I am terribly sorry, but here your calculations are wrong by EIGHT orders of magnitude, so I cannot give you a credit for this lab.
~ overheard during my uni physics courses
One all-important precaution which needs to be done whenever you’re trying to use back-of-the-envelope estimates (or actually, any kind of calculations, but this is a subject for a separate discussion) is making sanity checks. When you have your estimate, you at least need to try to think whether it makes sense given your experience; however, if you’re going to use the estimate to make any serious decisions, you should try to get some other way to double-check your estimate. BTW, in quite a few real-world scenarios, a back-of-the-envelope estimate can be double-checked by another – independent(!) – back-of-the-envelope estimate.
Benchmarks vs back-of-the-envelope estimates
When trying to convince various people to do some back-of-the-envelope estimates, I’ve often run into arguments such as ‘Hey, don’t even try this, the only way to get anything meaningful is to try it and benchmark it properly in the context of your specific task’.
I’m not going to argue with this statement, it is indeed a good advice – that is, if trying and benchmarking is feasible. However, especially at the earlier stages in development, trying certain things can be extremely expensive; this is when back-of-the-envelope estimates come in really handy.
Back-of-the-Envelope Estimates are not a replacement for benchmarks; instead, they’re prerequisites to implementing systems which can be used for benchmarking.
Conclusions
Summarizing all the anecdotal kinda-evidence above, I am comfortable to state the following.
On the one hand, back-of-the-envelope calculations are all-important at least at architectural stages of the project. In particular, they often enable avoiding implementing things which cannot possibly fly for various reasons – and on the other hand, allow avoidance of premature optimizations which will never be important enough for the task in hand.
On the other hand, back-of-the-envelope calculations are just one of the instruments available, and, unfortunately, they’re not the most reliable one ☹.
Make sure to double-check back-of-the-envelope estimates, and to test them, and to benchmark things – as long as it is feasible, that is.
References
[Acton] Mike Acton, ‘Data-Oriented Design and C++’, CppCon 2014, https://www.youtube.com/watch?v=rX0ItVEVjHc
[Atwood] Jeff Atwood, How Good an Estimator Are You?, https://blog.codinghorror.com/how-good-an-estimator-are-you/
[Loganberry04] David ‘Loganberry’, Frithaes! - an Introduction to Colloquial Lapine!, http://bitsnbobstones.watershipdown.org/lapine/overview.html
[McConnell] Steve McConnell. Software Estimation: Demystifying the Black Art (Developer Best Practices). https://www.amazon.com/exec/obidos/ASIN/0735605351/codihorr-20
[NoBugs12] ‘No Bugs’ Hare, 640K 2256 Bytes of Memory is More than Anyone Would Ever Need Get, Overload #112, 2012
[NoBugs13] ‘No Bugs’ Hare, Hard Upper Limit on Memory Latency, Overload #116, 2013
[NoBugs16] ‘No Bugs' Hare, Infographics: Operation Costs in CPU Clock Cycles, http://ithare.com/infographics-operation-costs-in-cpu-clock-cycles/
[NoBugs16a] ‘No Bugs’ Hare, Gradual OLTP DB Development - from Zero to 10 Billion Transactions per Year and Beyond, http://ithare.com/gradual-oltp-db-development-from-zero-to-10-billion-transactions-per-year-and-beyond/
[Smith] Reliability, Maintainability and Risk. Dr David J. Smith. ISBN 978-0080969022
Acknowledgement
Cartoon by Sergey Gordeev from Gordeev Animation Graphics, Prague.
- 4S = "4-Socket"; 4U = "4 rack Units". Popular examples include HP DL580 (my personal favourite for years ;-)), and Dell R930.
- And this happens in real-world too, see also below
Notes:
More fields may be available via dynamicdata ..