Title: Professionalism in Programming #26

Author:

Date: 03 June 2004 13:16:05 +01:00 or Thu, 03 June 2004 13:16:05 +01:00

Summary:

In my previous article we learnt the important optimisation process, the steps that ensure any optimisation really is worthwhile. Being eager code monkeys, I can see that you're all desperate for practical code optimisation techniques.

So here they are - in this final part we'll investigate specific code techniques for optimisation. Just don't tell anyone that I showed you. To redress the theological balance, we'll also see how to avoid optimisation in the first place.

Body:

Technological progress has merely provided us with more efficient means for going backwards. (Aldous Huxley)

Optimisation techniques

OK, we've avoided it for long enough - now it's time to look at some of the really gory optimisation details. You've proved that your program performs badly, and have found the worst code culprit. Now you need to whip it into shape. What can you do?

There's a palette of optimisations to choose from. Which is the most appropriate will depend on: the exact cause of the problem, what you're trying to achieve (e.g. increase execution speed, or reduce code size), and how much of a performance improvement is required. We'll look at some examples of specific optimisations, but in any given situation you probably won't have much of a choice - the best modification will be obvious.

I've grouped these optimisations into two broad categories: design changes and code changes. Generally, a change at the design level will have a far more profound effect on performance than a code level tweak. An inefficient design can strangle quality more than a few dodgy lines of source code, so a design fix - whilst harder - will have a bigger payoff.

Most often, our goal is to increase execution speed. The speed-based optimisation strategies are to:

speed up slow things,
do slow things less often, or
defer slow things until you really need them.

The other common optimisation goals are to reduce memory consumption (mainly by changing the data representation, or by reducing the amount of data accessed at once), or to reduce executable size (by removing functionality, or by exploiting commonality). As we'll see, these goals often conflict - most speed increases come at the expense of memory consumption, and vice versa.

Design changes

These are the macro optimisations, the fixes on a large scale. Here we're looking at the internal design of a module. This doesn't address the highest level of performance degradation - architectural deficiencies will have the most serious impact on the performance of a system. Unfortunately these are unlikely to be fixable, since so much code depends on them. Hardware bottlenecks are probably the worst - I worked on an embedded product that was eventually abandoned because the memory subsystem was fundamentally incapable of delivering acceptable performance.

Performance-sapping bad design is also hard (or uneconomical) to fix for similar reasons. We end up papering over the cracks, employing small code level fixes instead. The nearer a project is to a release deadline, the less likely you are to perform design changes; the risk is too great^[1]. When brave enough, the kinds of design optimisation we can perform include:

Remove functionality - the quickest code is code that doesn't run at all. A function will be slow if it is doing too many things, some of which are unnecessary. Cut out the superfluous stuff. Move it elsewhere in the program. Defer all work until it's really necessary.
Exploit parallelisation. Use threading to prevent one action being serialised after another.
Avoid or remove excessive locking. It inhibits concurrency, generates overhead and often leads to deadlock. Employ static checking to prove which locks are necessary and which aren't.
Compromise design quality to gain speed. For example: reduce indirection and increasing coupling. You can do this by breaking encapsulation, leaking a class' private data through it's public interface. Knocking down module barriers will cause irreparable damage to the design. If possible, try a less disruptive optimisation mechanism first.
Change the data storage format, or its on-disk representation to something more suited to high speed operation. For example: speed up file parsing by using a binary format. Transmit or store compressed files to reduce network bandwidth.
Add layers of caching or buffering, to enhance slow data access or prevent lengthy recalculations. Precompute values that you know will be needed, and store them for immediate access.
Create a pool of resource to reduce the overhead of allocating objects. For example: preallocate memory, or hold a selection of files open rather than repeatedly opening then closing them. This technique is particularly used to speed up memory allocation; older OS memory allocation routines were designed for simple non-threaded use. Their locks stall multithreaded applications, leading to abysmal performance.
Sacrifice accuracy for speed if you can get away with it. Dropping floating point precision is the obvious example. Many devices have no FPU (Floating Point Unit hardware), and instead employ slower software FPU emulation. You can switch to fixed point arithmetic libraries to bypass a slow emulator, at the expense of numeric resolution. This is particularly easy in C++, by taking advantage of its abstract data type facilities.

Accuracy is not solely due to your choice of data types; this tactic can run far deeper, to your use of algorithms or the quality of your output. Perhaps you can let the user make this decision - allow them to select slow but accurate or fast but approximate operation modes.
Avoid overuse of exceptions. They can inhibit compiler optimisations^[2], and when used too frequently will hamper timely operation.
Forgo certain language facilities if it will save code space. Some C++ compilers allow you to disable RTTI and exceptions, with a consequent saving in executable size.

The major design level optimisations involve improvements in algorithms or data structures. Most speed degradation or memory consumption is down to a bad choice of one or both, and a subsequent change will rectify this.

Algorithms have a profound impact on the speed of execution. A function that works acceptably in a small local test may not scale up when Real World data gets thrown at it. If profiling shows that your code spends most of its time running a certain algorithm, you must make it run faster. One approach is at the code level, chipping small improvements from each instruction. A better approach is to replace the entire algorithm with a more efficient version.

Consider this realistic example. A particular algorithm runs a loop 1000 times. Each iteration takes 5 milliseconds to execute. The operation therefore completes in around 5 seconds. By tweaking the code inside the loop, you can shave 1 millisecond from each iteration - that's a saving of 1 second. But instead, you can plug in a different algorithm, where an iteration takes 7 milliseconds, although it only iterates 100 times. That's a saving of almost 4.5 seconds - significantly better.

For this reason, prefer to look at optimisations that change fundamental algorithms, not that tweak specific lines of code. There are many algorithms to chose from in the computer science world, and unless your code is particularly dire you'll always gain the most significant performance improvements by selecting a better algorithm.

Data structures are intimately related to your choice of algorithms; some algorithms require certain data structures, and vice versa. If your program is consuming far too much memory then changing the data storage format may improve matters, although often at the expense of execution speed. If you need to search a list of 1000 items quickly, don't store them in a linear array with O(n) search time; use a binary tree with O(log n) performance.

Complexity Notation

Algorithmic complexity is a measure of how well an algorithm scales - how long it takes in proportion to the size of input. It's a qualitative mathematical model, allowing you to quickly compare the performance characteristics of different implementation approaches. It doesn't measure exact execution time (this is highly dependent on CPU speed, OS configuration, etc).

Complexity is determined by the amount of work an algorithm must perform: the number of basic operations it executes. A 'basic operation' is something like: an arithmetic operation, an assignment, a test, or a data read/write. Algorithmic complexity doesn't count the exact number of operations performed, just how this value relates to the problem size. We are usually interested in the worst case performance of an algorithm, the most work that will ever need to be done. A good comparison looks at the best case and average time complexity as well.

Algorithmic complexity is expressed using Big O notation, invented by the German number theorist Edmund Landau. For a problem with input size n, it might have a complexity of:

O(1): Order 1: This is a constant time algorithm. No matter how large the input set, it always takes the same amount of time to complete the task. This is the best performance characteristic possible.
O(n): Order n: A linear time algorithm's complexity rises in line with the input size. Searching a linked list will involve visiting more nodes as the list size grows; the number of operations is directly related to the size of the list.
O(n²): Order n squared: This is where performance really begins to get bad: complexity is increasing faster than the rate of input growth. A quadratic time algorithm may seem fine when you give it a small set of data, but large data sets take a seriously long time. The bubblesort algorithm is O(n²).

Of course, complexity may be of any order; the quicksort algorithm averages O(n log n). This is worse than O(n), but far better than O(n²). A simple optimisation route for a slow bubblesort algorithm is to replace it with a quicksort algorithm, especially since there are plenty of freely available quicksort implementations.

These big O expressions don't include constants or low-order terms. You'll never see any talk about a complexity of O(2n+6). When n gets large enough, these constants and low-order terms dwarf into insignificance.

Selecting a different data structure seldom requires you to implement the new representation yourself. Most languages come with library support for all common data structures.

Code changes

And so now we creep anxiously on to the really disgusting stuff: the micro level, small scale, short sighted, code tweaking optimisations. There are many ways to molest source code for the sake of performance. You must experiment to see what works best in each situation - some changes will work well, others will have little, or even negative effect. Some may prevent the compiler's optimiser from performing it's task, producing startlingly worse results.

The first task is easy: turn on compiler optimisation, or increase the optimisation level. It often gets disabled for development builds since the optimiser can take a very long time to run^[3], increasing the build time of large projects by an order of magnitude. Try configuring the optimiser, and test what affect this has. Many compilers allow you to bias optimisation towards extra speed or reduced code size.

There are a collection of very low level optimisations that you should know about, but generally avoid. These are the kind of changes that a compiler is able to perform for you; if you've switched the optimiser on, it'll be looking in these areas. When applied by hand, they tend to butcher your code's readability the most, since they warp the fundamental logic out of shape. Only consider using one of these optimisations if you can prove it's really required, and that there are no alternatives:

Loop unrolling

For loops with very short bodies, the loop scaffolding may be more expensive than the looped operation itself. Remove this overhead by flattening it out - turn your 10 iteration loop into 10 consecutive individual statements.

Loop unrolling can be done partially; this makes more sense for large loops. You can insert four operations per iteration, and increment the loop counter by four each time. This tactic gets nasty if the loop doesn't always iterate over a whole number of unrolls.

Code inlining

For small operations, the overhead of calling a function might be prohibitive. Splitting code into functions brings significant benefits: clearer code, consistency through reuse, and the ability to isolate areas of change, yet it can be removed to increase performance. This process merges the caller(s) and the callee.

There are a number of ways to do this. With language support, you can request it in the source code (in C/C++ using the inline keyword); this method preserves a lot of the code's readability. Otherwise you have to merge the code yourself, either by duplicating the function over and over again, or using a preprocessor macro to do the work for you.

It's hard to inline recursive function calls - how would you know when to stop inlining? Try to find alternative algorithms to replace recursion.

Often inlining opens the way for more code level optimisations to be performed, that were not previously possible across a function boundary.

Constant folding

Calculations involving constant values can be computed at compile time, to reduce the amount of work done at run time. The simple expression return 6+4; can be reduced to return 10;. Carefully ordering the terms of a large calculation might bring two constants together enabling them to be reduced into a simpler subexpression.

It's unusual for a programmer to write something as obvious as return 6+4; However these sorts of expression are common after macro expansion.

Move to compile time

There is more you can do at compile time than just constant folding. Many conditional tests can be proved statically, and removed from the code. Some kinds of test can be avoided altogether; for example, remove tests for negative numbers by using unsigned data types.

Strength reduction

This is the act of replacing one operation with an equivalent that executes faster. This is most important on CPUs with poor arithmetic support. For example: replace integer multiplication/division with constant shifts or adds; x/4 can be converted to x<<2 if it's faster on your processor.

Subexpressions

Common subexpression elimination avoids the recalculation of expressions whose values have not changed. In code like this:

int first = (a * b) + 10;
int second = (a * b) / c;

The expression (a * b) is evaluated twice. Once is enough. You can factor out the common subexpression, and replace it with:

int temp = a * b;
int first = temp + 10;
int second = temp / c;

Dead code elimination

Don't write needless code; prune anything that's not strictly necessary to the program. Static analysis will show you the functions that are never used, or the sections of code that will never execute. Remove them.

Whilst the above are particularly distasteful code optimisations, the following ones are slightly more socially acceptable. They focus on increasing program execution speed.

If you find that you're repeatedly calling a slow function, don't call it so often. Cache its result and reuse this value. This might lead to less clear code, but it will run faster.
Reimplement the function in another language. For example: rewrite a critical Java function in C using the JNI (Java Native Interface) facility. Conventional compilers still beat JIT code interpreters for execution speed.

Don't naively assume that one language is 'faster' than another - many programmers have been surprised how little difference using JNI makes. It's been commonly claimed that OO languages are far slower than their procedural counterparts. This is a lie. Bad OO code can be slow, but so can bad procedural code. If you write 'OO' style code in C it's likely to be slower than good C++; the C++ compiler will generate better tuned method dispatch code than your attempts.
Reorder the code for improved performance.
Defer work until it's absolutely necessary. Don't open a file until you're about to use it. Don't calculate a value if you might not need it, wait until it's wanted. Don't call a function yet if the code will work without it.
Hoist checking further up the function. After having done a lot of work, it's a waste to perform a test that might lead to early return. Make the check sooner to avoid all the previous work.
Move invariant calculations out of a loop. The most subtle source of this problem is a loop condition. If you write: for(int n = 0; n < tree.numApples(); ++n), yet numApples() manually counts one thousand items on every call, you'll have a very slow loop. Move the count operation before the loop:
```
int numApples = tree.numApples();
for (int n = 0; n < numApples; ++n) {
... something ...
}
```
Use lookup tables for complex calculations; trading time for space. For example: rather than write a set of trigonometric functions that individually calculate their values, precalculate the return values and store them in an array. Map input values to the closest index into this array.
Exploit short circuit evaluation. Make sure that the tests likely to fail are placed first to save time. If you write a conditional expression: if (condition_one && condition_two) make sure that condition_one is statistically more likely to fail than condition_two.
Don't reinvent the wheel - reuse standard routines that have already been performance tuned. Library writers will have already honed their code carefully. But be aware that a library may have been optimised for different goals than yours; perhaps an embedded product was profiled for memory consumption, not for speed.

Size-focused code level optimisations include:

Produce compressed executables, that unpack their code before running. This doesn't necessarily affect the size of the running program, but it reduces the storage space required^[4]. This might be important if your program is stored in limited flash memory.
Factor common code into a shared function.
Move seldom used functions out of the way. Put them into a dynamically loaded library, or into a separate program.

Writing efficient code

If the best approach is to not optimise, how can we avoid any need to improve code performance? The answer is to design for performance, planning to provide adequate quality of service from the outset, rather than trying to whittle it out at the last minute.

Some will argue that this is a dangerous road to follow. Indeed, there are potential hazards for the unwary. If you try to optimise as you go along, you'll write at a lower level than needed; you'll end up with nasty, hacky code full of low-level 'performance' enhancements and special 'back door' exploits.

How do we reconcile these seemingly opposing views? It isn't hard, because they're not actually at odds. They are two complementary strategies:

write efficient code, and
optimise code later.

If you make a point of writing clear, good, efficient code now you will not need to perform heavy optimisations later. Some claim that you don't know whether any optimisation is necessary at first, so you should write everything as simply as possible, and only optimise when profiling proves that there is a bottleneck.

Pessimisations

Without careful measurement you can easily end up writing 'optimisations' that are not at all optimal. A perfectly good optimisation for one situation might turn out to be a performance disaster in another. Here's a little case study. Exhibit A: The copy on write string optimisation.

This was a common optimisation applied to C++ standard library implementations. Programs which performed intensive string manipulation experienced a massive overhead when copying long strings, both in terms of execution speed and memory consumption. Copying large strings means duplicating and shoveling around large quantities of data. Many string copies are automatically generated, temporary objects that are created and then thrown away shortly after - never actually modified. The expensive copy operation is an unnecessary cost.

The copy on write (COW) optimisation turns the string data type into a form of smart pointer; the actual string data is held in a (hidden) shared representation. The string copy operation now only has to perform an inexpensive 'smart pointer' copy (attaching a new smart pointer to the shared representation), rather than duplicate the entire string contents. Only when you make a modification to a shared string is the internal representation copied, and the smart pointer remapped. This optimisation avoids a large number of unnecessary copy operations.

COW worked well in single threaded programs; it was shown to speed up performance greatly. However, a problem became apparent when multithreaded programs used COW strings. (Indeed, this problem also manifests in single threaded programs, if the COW string class is built with multithreading support). Many implementations performed very conservative thread locking around the copy operations - these locks become a major bottleneck. Suddenly a lightning fast program slowed down to a crawl. The COW optimisation proved to be a serious pessimisation.

Far better multithreaded performance was achieved by reverting to 'classic' string implementations, and writing more careful code that reduced automatic string copying. Thankfully, library implementors now provide more intelligent versions of the string class, which are both thread safe and fast.

This approach has obvious flaws. If you know that you need a list type with good search performance (because your program must perform fast searches) pick a binary tree over an array. If you're not aware of any such requirement, then go for the most appropriate thing that will work. This still might not be the simplest - an array is a hard data structure to manage.

As you design each module, don't chase performance blindly - only spend effort on it when necessary. Understand the mandated performance requirements and at each stage justify how your choices will meet these requirements. When you know what level of performance is required, it's easier to design in appropriate efficiency. It also helps you to write explicit tests to prove that you do achieve these performance goals.

Some simple design choices that will increase efficiency and aid later optimisation are:

Minimise your reliance on functions that might be implemented on remote machines, or that will access the network or a slow data storage system.
Understand the target deployment and how the program is expected to be run, so you can design it to work well in these situations.
Write modular code, so it's easy to speed up one section without having to rewrite other sections too.

Conclusion

High performance code is not as important as some people think. Although you sometimes do have to roll your sleeves up and tinker with code, optimisation is a task you should actively avoid. To do this, make sure you know the system's performance requirements before you start working on it. At each level of design, ensure you provide this quality of service. That way, optimisation will be unnecessary.

When you optimise, be very methodical and measured in your approach. Have a clear goal, and prove that each step is getting you closer to it. Be guided by solid data, not your hunches. As you write code, ensure that your designs are efficient, but don't compromise on quality. Worry about codelevel performance only when it proves to be a problem.

That concludes this three part miniseries on optimisation. I hope you've found it useful.

^[1] Sadly, it's often only near project deadlines that anyone notices performance isn't good enough.

^[2] try/catch blocks act, like functions, as a barrier to an optimiser. It's not possible to look through the barrier to perform optimisation, so some potential speed ups will be lost.

^[3] It has to do complex inspection of the parsed code to determine the set of possible speed ups, and select the most appropriate ones.

^[4] It may have the pleasant side-effect of decreasing program start-up time; a compressed executable will load from disk much faster.

Notes:

More fields may be available via dynamicdata ..