Title: Professionalism in Programming #22

Author:

Date: 06 October 2003 13:16:00 +01:00 or Mon, 06 October 2003 13:16:00 +01:00

Summary:

Finding fault.

Body:

Nobody's perfect. Well, except for me that is. All day I have to sit down and work through tedious problems in other people's code. The test department discovers that our software falls over when they do such-and-such. So I trawl through the system to find what Programmer Fred did wrong three years ago, patch it up and send it back to test for them to break again.

Of course, you wouldn't find me making those sorts of elementary mistakes, not a chance. My code is watertight. Faultless. Low fat and cholesterol free. I don't write a line until I've gone over everything in my head, I don't complete a code statement without considering all the special cases that might occur, and I type so carefully that I've never once misplaced = for == in an if statement.

Totally fault free, me. Really.

Well, perhaps not quite.

The facts of life

I don't think anyone sits trainee programmers down and explains the facts of life to them. It's like this, son. There are the birds and the bees. Oh, and the bugs. Bugs are the inevitable dark side of constructing software, a simple fact of life. Sad, but true. Whole departments, and even industries, exist to manage them.

Everyone reading this will be only too aware of the proliferation of faults that exist in released software. How do bugs appear with such frightening regularity and in such great magnitude? It's all down to human nature. Programs are written by humans. Humans make mistakes. They make mistakes for a number of reasons (or excuses). They make mistakes because they don't understand the system they're working on well enough, because they don't correctly understand what they are implementing, but more often than not because they just don't pay enough attention to what they're doing. Most bugs are due to mindlessness. I once saw a wonderfully simple illustration of this, play along at home:

The tree that grows from an acorn is called an ....................
The noise a frog makes is a ....................
The vapour that rises from fire is called ....................
The white of an egg is called the ....................

The yolk, right? Think about it. If you didn't fall for that one, then you were probably only paying attention because I'd just warned you. Hey, give yourself a brownie point anyway. But tell me who warns you every time that you're about to write a potentially flawed line of code? They'd deserve a lifetime supply of brownie points.

So as programmers we're all to blame for the bad state of software. We're all guilty. Do we learn to live with the guilt, or do we do something about it? There are two types of response. The first school is the it's not a fault, it's a feature school. A fault turns up and we respond in the words of the great philosopher Bart Simpson: I didn't do it. Nobody saw me do it. You can't prove anything [Simpsons]. We blame compiler quirks, OS flaws, random climate changes, or computers with a mind of their own. Or as I alluded to in the opening paragraphs, we blame other people. A Teflon raincoat can be a very handy programming tool.

However, we should really subscribe to the second school, the school that concedes that software errors are not entirely inevitable. Many of these kinds of mindless mistake can be picked up or even prevented, and as responsible programmers we should be taking steps to do so. In this article we'll find out about of this, and look at some good debugging techniques to employ when bugs do slip through the net.

Nature of the beast

Contrary to popular belief the term bug was in use before the advent of computers. In the 1870s Thomas Edison talked about bugs in electrical circuits. The story of the Harvard University Mark II Aiken Relay Calculator tells of the first recorded computer bug. In 1947, the early days of computers when they took up whole rooms, a moth flew in and managed to lodge itself in some circuits, causing a system failure. They taped it into the logbook and wrote below: First actual case of bug being found. For posterity's sake it has been preserved in the Smithsonian Institute.

Bugs are bad news. But what are they really? It's worth identifying the different varieties of bug we encounter, understand how they are born, survive and can be exterminated. It's also important to know what to call them; see the sidebar for more on this matter.

Nomen nudum: what shall we call them?

The term 'bug' is remarkably evocative, and incredibly imprecise. It's very easy to throw around words without really understanding what they mean. If we use more specific terminology then we'll get straight in our head some key facts.

The exact meaning of the three terms below depends on who's defining them; this can get a bit philosophical. These interpretations are largely inspired by IEEE literature [ANSI-IEEE].

Error: An error is something that we do wrong. It is a specific human action that results in software containing a fault. Whilst merrily coding away, for example, forgetting to check a condition (like the size of an array before indexing into it) is an error.
Fault: A fault is the consequence of an error, embodied in the software. I made an error, and this resulted in a fault in the code. Now at first this is a latent problem. If the code I've just written is never executed then this fault will never have a chance to cause problems. If execution often passes through the faulty code, but never in the particular way that triggers the fault, we'll never notice that there is a fault at all. This subtle little point is what makes debugging so notoriously difficult. A faulty line of code may appear to work flawlessly for years, and then one day it causes the most bizarre system tantrum you've ever seen; you'll not suspect the aged code since it's been so reliable for so long.
Failure: So a fault, if encountered, may cause a failure. It may not. The failure is what we really care about, the manifestation of the fault, and it's the only thing we'll probably take notice of^[1]. A failure is the departure of your program's operation from its requirements, from its expected behaviour. This is where we are verging on philosophy. If a tree falls over in a forest does it make a sound; if the running program doesn't exercise a bug, is the mistake still a fault? These definitions help to answer this.

If you want a hard definition of bug then it is a synonym for fault. The problem with the word "bug" is that users throw it around without knowing exactly what they're describing; this dilutes any true meaning. When being precise it's best ignored. There are other related words that can be thrown into the lexicon for good measure: defect, for example. Again, they'll mean different things ifyou ask different people, and we can happily survive here without getting too anal about them.

In most situations these (perhaps arbitrary) distinctions don't really matter, you can happily talk about a fault, an error, or a bug and not worry about being pedantically misinterpreted. However, in an article about bugs it's good to be clear what we're talking about.

Software bugs fall into a few broad categories, and understanding these will help us to reason about them. Some bugs are naturally harder to find than others, and this usually turns out to be related to their category. Stepping right back and squinting at them from a distance, we see these three classes emerge:

Failure to compile

It's really annoying when the code you've spent ages writing fails to compile. It means that you'll have to go and fix a tedious little typo or some parameter type mismatch, then wait for the compiler to run again before you can get to the real job of testing your handiwork. It may come as a surprise to learn that this is the best type of error you can get. Why? Simply because it's the easiest to detect and fix^[2]. It's the most immediate, and the most obvious.

Faults cost more to fix the longer it takes to detect them. We saw in the previous article that the cost of changing software rises dramatically over the life of a project, and this holds for fixing faults. The sooner we catch them and fix them, the sooner we can move on, the less fuss and cost they incur. Compilation failures are easy to notice, and usually very easy to fix. You can't run the code until you have.

Most of the time a compilation failure will be a silly syntactic mistake, or something simple like calling a function with the wrong number or type of parameters. The failure might be due to a fault in a makefile, it might be a link stage error (say, a missing function implementation), or even a build server running out of disk space.

Runtime crash

After enough donkeying about fixing your compilation errors, out pops your executable and you merrily run it. Then it crashes. You probably swear and mutter something about random cosmic rays. After the sixtieth crash you're threatening to throw your computer out of the window. These kinds of error are far harder to deal with than compilation errors, but they're still reasonable to work with.

This is because, like compilation errors, they are blindingly obvious. You can't argue with an ex-program. You can't pretend a crash is a feature. When it has kicked the bucket and shuffled off its mortal coil, you step back and begin to figure out where your program went wrong. You'll have some clues (what input sequence preceded the crash, what had happened previously), and can employ tools to discover more information (more on this later).

Unexpected behaviour

Now this is the really nasty one, when your program isn't pushing up the daisies, just pining for the fjords. Suddenly it does the wrong thing. You expected a blue square and out popped a yellow triangle. The code continues to meander on its happy way with total disregard for your frustration. What caused the yellow triangle to appear? Has the program been overthrown by a militant army of guerrilla COM objects? It will almost certainly be a minute logic problem in the bowels of the code that executed over half and hour ago. Good luck finding it...

A failure may manifest itself because of a defective single line of code, or may only show up when several interconnecting modules are finally glued together, their assumptions not quite matching up.

Moving in a bit, and looking more closely at runtime errors, a few more groupings of fault become clear. Here they are ranked in order of pain, from splinter to decapitation.

Syntactical errors

Whilst these are mostly caught by the compiler at build-time, sometimes language grammar errors slip through undetected. They can generate weird and unexpected behaviour. The syntax error will often be one of; mistaking == for =, or && for & in a conditional expression, forgetting a semicolon or adding one in the wrong place (the classic is after a for statement), forgetting to enclose a set of loop statements in braces, or mismatching parentheses. The simplest way to avoid being tripped up by these sorts of error is to keep all warnings switched on; compilers tend to moan about of lot of these potential problems.

Build errors

Whilst not necessarily a runtime fault per se, the build error manifests itself at run time. Be on the lookout and always distrust your build system, no matter how good you think it is. In these enlightened times you're unlikely to come across a compiler bug. However, you may not always be running what you thought you built. Several times I've been hit by this: the build system failed to create a program or shared library, perhaps because makefiles didn't contain adequate dependency information, or the old executable had a bad timestamp. Every time I tested a modification I was still running the old buggy code unawares.There are a number of ways to confuse a build system, but the worst part is you don't notice it failing - like a leprous limb.

It can take quite some time (and maybe even a brief stint in the funny farm) to notice that this is biting you. For this reason, when you feel at all wary of what's going on it can be sensible to do a total clean out of your project, and then rebuild from scratch. This should flush out any possible build system problems^[3].

Basic semantic bugs

The majority of runtime faults are due to very simple errors causing incorrect behaviour. Using uninitialised variables is a classic example, and can be quite hard to track since the program's behaviour may depend on the garbage value that waspreviously in the memory location used by the variable. One time the program will work fine, another timeit may fail. Other basic semantic faults are: comparing floats for equality, writing calculations that don't handle numerical overflow, and rounding errors from implicit type conversions (losing the sign of a char is common). This type of semantic fault is often caught with static analysis tools.

Semantic bugs

These are much harder to identify, the insidious errors that won't be caught by inspection tools. A semantic bug might be a low-level error like the wrong variable being used in the wrong place, not validating a function's input parameters, or getting aloop wrong. It may be a higher-level piece of wrong-headedness, calling an API incorrectly, or not keeping an object's stateinternally consistent. A pile of memory related errors fall in this category - they can be evil to find due to their ability to warp and corrupt your running code, so that it behaves in totally unpredictable and unreasonable ways. Programs often behave weirdly. The only consolation is that they're doing exactly what we told them to.

The best kind of runtime failures are the reliable ones. If they're reproducible, they are much easier to write tests for, and track down the cause of. The failures that don't always occur tend to be memory corruptions.

Now that we have things in neat little boxes, let's zoom right in and take a look at some of the specific types of runtime failure. These are some common semantic faults that we come across.

Segmentation faults (or protection faults): come from accessing memory locations that have not been allocated for the program's use. They result in the operating system aborting the application code and producing some form of error message, usually with diagnostic information. This can be triggered by dodgy pointer arithmetic, or far too easily by typing errors involving pointers. A common C typo causing a segfault is scanf("%d", number); The missing & before number makes scanf try to write into the memory location referenced by the (garbage) contents of number, and poof! the program disappears in blue smoke. If you're really unlucky, though, number happens to hold a value that equates to a valid memory address. Now your code will continue as if nothing was wrong, until the memory you just wrote over is used and your fate is in the lap of the gods.
Memory overruns: are caused by writing past memory that has been allocated for your data structure, be it an array, a vector, or some other custom construct. When writing values into the wide blue yonder, you'll generally end up clobbering data from some other part of your program. If you're running on an unprotected operating system (more common in embedded environments) you may even tamper with data from another process or the OS itself. Ouch. Memory overrun is a common problem and difficult to detect, usually the symptom is random unexpected behaviour manifesting at a much later point than the overrun, many thousands of instructions later. If you're lucky the memory overrun hits an invalid memory address and you get a segfault which is hard not to notice. Use 'safe' data structures wherever possible to insulate yourself from the possibility of such disaster.
Memory leaks: are a constant threat in non-garbage collected languages^[4]. When you want some memory you have to ask the runtime for it nicely (using malloc in C or new in C++), and then you have to be polite and give it back when you're done (using free and delete respectively). If you rudely forget to release memory, your program slowly consumes more and more of the computer's scarce resources. You may not notice it at first, but gradually your computer's response will degrade, as memory pages thrash to and from the disk. Two other classes of error relate to this: freeing a memory block too many times causing unpredictable environmental failures, and not managing other scarce resources carefully, like file handles and network connections.
Running out of memory: is always a possibility, as is running out of file handles or any other managed resource. It might be rare (modern computers have so much memory, how could this possibly happen?) but that's no excuse to ignore the potential for failure. Only sloppy code fails to make appropriate checks and will consequently perform in a very brittle manner when run in constrained situations. Always validate the return status of a memory allocation or file open system call. It is worth noting that some modern operating systems^[5] will never return a failure from a memory allocation call - every allocation returns a pointer to a reserved but unallocated memory page. When the program eventually tries to access this page, an OS mechanism traps the access and then really allocates memory to the page, resuming normal program operation. This all works nicely until the available memory finally is exhausted. Your program will then be sent error signals, a long time after the relevant allocation occurred.
Maths errors: (or "Math" errors for those using strange variants of the English language - Ed) come in a number of guises: floating point exceptions, incorrect mathematical constructions or incorrect use of floating point numbers (for example, divide by zero). Even trying to output a float but passing an int through printf("%f") can cause your program to bomb with a maths error.
Program hangs: are usually caused by bad program logic. Infinite loops with badly crafted terminal cases are the most common, we also see deadlock or race conditions in threaded code, and in event-driven code the waiting on events that will never occur. It is usually fairly easy to interrupt the running program, see where the code has stalled and determine the cause of the hang.

Different OSes, languages, and environments report these errors in different ways, with different wording. Some languages try to avoid types of error by not providing features you can shoot yourself in the foot with. Java, for example, has no pointers and checks every memory access you make automatically.

Pest extermination

Like a hypochondriac, our code is constantly complaining about being ill. More often than not it genuinely is in need of some attention. We're the doctors. If our code is sick then we've got to perform the diagnosis, the surgery, and nurture it through its convalescence.

Weeding out bugs is hard. Not only do humans make mistakes when writing, they also make mistakes when reading. When I proof read these articles I have a tendency to read what I meant to write and not what I really wrote; it works the same for software. When we look at our faulty code we'll tend to see what we intended, not how the compiler actually interprets our instructions. In this respect the compiler is really quite pedantic, it can only produce exactly what we asked, not what we were hoping for.

Some programmers introduce far fewer faults into their code than their peers (as much as 60% less), can find and fix faults quicker (in as little as 35% of the time), and introduce fewer faults as they do so (figures from [Gould]). How do they do it? They are naturally able to pay more attention to the task, and can focus on the microscopic level of the code they're writing whilst keeping the broader picture in mind.

The professional programmer is always mindful of introducing faults, and will try to fix a detected problem sooner rather than later. Certainly, it's wrong to presume that we only check for problems when the software has been written. I've known many programmers who believe that the test department exists to detect their bugs for them. This is just plain wrong.

There is a clear difference between testing and debugging. Testing identifies the presence of a fault, e.g. the program output is incorrect, whereas debugging is the process of reproducing, locating, understanding, and fixing a fault.

Testing is QA, that is quality assurance; debugging is repairing a problem. You don't get quality by fixing bugs, you can't add it in at the end of software development, you must plan the quality into the architecture and implementation. Testing won't prove the absence of faults, it won't catch all errors. It's impossible to draft exhaustive test cases; software is just too complex. We will inevitably release software into the field containing faults that may still crop up. Yes, the quality of our software is in part down to the quality of our testing department, but also to our personal testing, and the quality of the fixes that we implement.

Debugging techniques and tools

There is an art to debugging, and it's very much something to be learnt. It's a skill. Experience shows you how to become an effective debugger. And this is something that we will all get plenty of experience at. Now, different people's brains work in very different ways, and they have different ways of problem solving. What works for one programmer may not for another. However, there are some general principles that always apply.

The sidebar (next page) offers a whistle-stop tour of the tools available to aid our bug hunting. How we use these tools and where and when they are applicable will differ from situation to situation. However, one of the most potent weapons in our debugging arsenal is a distrust of anyone's code mixed with a healthy dose of cynicism. The cause of your errant behaviour could be absolutely anything, and in the act of diagnosis we should start by eliminating even the most unlikely of candidates.

How difficult it is to find a fault depends on how well you know the code it's lurking in. It's hard to jump into some random source and make any kind of judgement about it without knowing the structure and how it's intended to work. For this reason, if you have to debug some new code take time to learn it first, it really will pay off in the long run.

The ease of debugging is also dependent on the control you have over the execution environment, how much you can play around with the running program and inspect its state. In an embedded environment debugging can be much harder because the tool support is sparser. You're also probably running in an environment that is providing a lot less insulation from your own stupidity; little mistakes can have much bigger consequences.

There are two distinct facets to debugging: finding the fault and fixing the fault. The following sections describe a sensible approach to both.

The golden rule when debugging is this: Use Your Brain. Think. Consider what you're doing. Don't flail around thoughtlessly hacking at bits of code until something begins to look like it might be working. Now, sometimes a quick fiddle about will get you results, sometimes some hacky little exploratory tests will pinpoint the problem quickly. So is it a justifiable thing to do? Perhaps, but if you make the conscious decision to do some quick-and-dirty stabbing around, set yourself a hard time limit to do it in. It's all too easy to spend an entire morning with the 'just one more little go' approach. After the time limit is up, follow the more methodical approach laid out below.

If your quick stab turns up trumps and you do find the fault, reengage your thinking gear. Look at the How to fix faults section below, make the change carefully and thoughtfully. Just because the fault was easy to find, it doesn't necessarily mean that the fix is quite as obvious as it looks.

Wasp spray, slug repellant, fly paper...

Debugging would be a lot nicer if there was someone else to do the job for us. Whilst that'll never happen, we can make the job a lot more palatable with a little help. Many useful tools exist; you'd be stupid not to take advantage of them. A little time learning how they work may reduce your debugging time immeasurably.

Some tools are interactive, allowing you to inspect the code in various ways whilst a program is actually running. In advanced development environments these tools may be seamlessly integrated, or they may need to be run as separate programs. Other tools are non-interactive, often running as a code filter or parser spitting out information about the code following analysis. In this list we'll also consider tools you may not have thought of as debugging aids, and even some helpful procedures.

Debugger.: This is perhaps the most well known debugging tool, its name kind of gives its purpose away. A debugger is an interactive tool that allows you to view the internals of your running program and poke around with it. You can follow the flow of control, inspect the contents of variables, set breakpoints in the code for later interruption, even run arbitrary sections of code at will. Debuggers come in many shapes and sizes, some command line tools, some graphical applications. Usually there will be at least one available for your particular development platform (although the ubiquitous gdb seems to get ported to every conceivable platform these days). A debugger relies on symbols being left in your executable (these are the compiler's debugging information which are normally stripped out at the final link stage) - it uses these to provide you with information about function and variable names, and the location of the source files. A debugger is a rich and powerful tool, however I believe that they can often be misused or overused, and can actually inhibit good debugging. Programmers easily get wrapped up chasing what the program is doing, getting side tracked by observing the wrong variable values, stepping into the wrong functions, and don't sit back and think about the problem they are trying to solve. A little more thought about a failure may pinpoint the specific fault far quicker than trying to hunt it down in a debugger.
Memory access validator.: This interactive tool inspects your running program for memory leaks and overruns. It can be remarkably useful, showing up reams of memory release failures you never knew existed.
System call trace utilities,: like Linux's strace show all the system calls issued by an application. This can be a good way to see how a program is interacting with its environment, particularly useful when it appears to be stalled on some external activity that is not happening.
Core dump.: This is a Unix term for the OS-generated snapshot of a program that can be produced when it exits abnormally. The term derives from archaic machines with ferrite core memory, however the dump file is still called core. It contains a copy of the program's memory when it died, the state of the CPU registers, and the function call stack. The core dump can be loaded into an analyser (which is most often the debugger) to query a number of useful bits of information.
Logging facilities: allow you to programmatically generate information about your application as it runs. Rich logging systems allow you to assign priorities to the output (e.g. debug, warning, fatal), and then filter out a particular message level at run time. The program's log gives a history of activity that can help pinpoint what circumstances triggered a failure. The logging facility may be an integral part of the operating environment, or provided by a third party library. Without such support you'll see the use of printf/cerr diagnostic information, introduced on a very ad hoc basis. This is about as basic as you can get, and must be carefully removed in the production code release. printfs may also clobber the normal program output. I have worked in environments where even lowly printfs weren't available; when bringing up a system board the only diagnostic output I had was a single eight segment LED display, and a scope attached to a spare system bus! There are downsides to logging: it can slow down program execution and bloat the executable size if the logging statements can't be compiled out completely. Some logging systems are useless for trapping a program crash, since at the crash time messages may still be stuck in an output buffer that will never get flushed. Be sure you know how well your logging mechanism works, and always send diagnostic printfs to the unbuffered stderr, not stdout.
Static analyser.: This is a type of non-interactive tool that inspects source code for potential problem areas. Many compilers include support for this kind of functionality when set to their maximum warning level, but good static analysis tools go far beyond this. Products exist to discover problem code, any usage of undefined behaviour or non-portable constructs, to identify dangerous programming practices, to provide code metrics, to enforce coding standards, and to create test harnesses. Use of a static analysis tool can eradicate many errors before they have a chance to bite. A handy safety net. It's a sound pragmatic idea to use a static analyser from a different company than your compiler manufacturer - they're less likely to have made the same set of mistakes.
Code reviews: often identify problem areas that would otherwise go undetected. They were described in an earlier article [Goodliffe4]. If you've never done one, you'll be surprised how many faults can get unearthed this way.
Defensive programming techniques [Goodliffe9]: greatly reduce the likelihood of all sorts of errors. In particular, the use of assertions to check logical invariant conditions can be crucial. Whilst tracking a bug you can insert more assertions to validate the assumptions you've made about the code.
Fault logging/reporting database systems: such as Bugzilla provide persistent records of all failures so no problem, no matter how small, is ever forgotten. It helps you gather statistics on the quality of the project, so you know when it has reached a releasable state. It is a key tool, integral to the development process. It won't find faults for you, but helps co-ordinate the process of doing so. It allows you to assign problems to engineers, to mark issues as resolved or duplicated, and acts as a bridge between the test department and development. No software development organisation should function without such a system in place, although it's frightening that many do.
Source code editor.: A good editor will prevent you from making a whole pile of silly mistakes. Syntax highlighting often provides visual cues when you've made an error. You'll see when you mismatch comment delimiters, or get brace or parenthesis mismatches. A goodediting environment also provides navigation around your code so you can find offending areas easily.
A version management system: stores the source code and a history of its development. It allows you to review changes that have been made, find out who made them and when. When a fault rears its head you can revert to a previously working revision and inspect the differences that have been made.

Bug hunting

So how do we find bugs? If there was a simple three-step process we'd all have learnt it and our programs would be perfect by now. As it is, there isn't and they aren't. Let's try to distil the available bug hunting wisdom.

Compile time errors. We'll look at these first, since they are comparatively easy to deal with. When your compiler comes across something unpleasant it will not normally just complain the once, but take the opportunity to sound off about life in general, spitting out a ream of other subsequent error messages. It's been told to do this; upon encountering any error the compiler tries to pick itself back up and carry on parsing away. It's not always too good at it, but with code like yours who could blame it?

The upshot is that the later compiler messages can all be quite random and irrelevant. You should only need to look at the very first error reported, and sort out that problem. Have a glance further down the list by all means, there may be some other good things down there, but more often than not there isn't.

Even this first compiler error may be cryptic or misleading, depending on the quality of the compiler (if you're really stumped by what an error means try another compiler, perhaps). Hardcore C++ template code can produce inspired errors from some compilers. The reported fault usually is on the line that the compiler reports, but sometimes it may actually be on the preceding line - a syntax error there causes the following line to be nonsensical, and this is what thecompiler notices and moans about.

Linker errors, on the whole, are far less cryptic. The linker will tell you that it's missing a function or library and so you'd better go off and find it (or write it). Sometimes the linker may complain about arcane vtable related C++ problems, this is usually a symptom of missing a destructor's implementation or something like that.

Run time errors require a little more of a game plan. If your program contains a bug then it's likely that somewhere in the code a condition you believed to be true isn't. Finding the bug is a process of confirming what you think is correct until you find the place where the condition doesn't hold. You have to develop a model of how the code really works and compare this with how you'dintended it to. The only sensible way to do this is methodically.

Scientific method is the process scientists use to develop an accurate representation of the world. That sounds akin to what we are trying to do. There are four steps to scientific method: (i) observe a phenomenon, (ii) form a hypothesis to explain it, (iii) use thishypothesis to predict the results of further observations, and finally (iv) perform experiments to test these predictions. Now I'm not proposing that we use scientific method wholesale, for a start we're trying to get rid of the errant phenomenon rather than build a model of it. However, scientific method is a good backbone and you'll see it reflected in the steps below.

Identify

a failure. It all starts here, when you notice that the program doesn't do what it's supposed to. It may crash, it may just produce a yellow triangle, but you know something's up and you've got to fix it. The first thing you do is put a fault report into the fault database. This is particularly valuable if you're in the middle of tracking some other bug or have no time to handle the fault right now. Making a record ensures the fault doesn't get lost. Don't just make a mental note to come back to a problem later. You'll forget.

Even if you're going to start fixing the fault immediately, having the record in the database serves a useful purpose - it shows other developers that a problem has been identified and is under investigation. It also allows reports to be generated about the number of issues remaining/resolved in the codebase.

Identify the nature of the errant behaviour. Characterise the problem as completely as possible by answering questions like: is it timing sensitive, does it depend on input, system load, or program state. If you don't understand the bug before you try to fix it you'll just be changing code until the symptom disappears. You may only have masked a cause so the fault will crop up elsewhere.

Reproduce

it. This goes alongside characterising the failure. Work out the set of steps you can take to reliably trigger the problem. If there is more than one way then document them all.

You have a problem if the bug isn't reproducible; the best you can do is set mousetraps for the fault and see what you can find out when it does occur. For these unreliable failures, keep careful notes of the information you collect, it may be a while until you next see the problem crop up.

Locate

the fault. This is the big one. You've got the scent, now you need to track the beast and pinpoint its location from what you've learnt. That's far more easily said than done. This is a process of eliminating all the things that don't contribute to the failure, or are working correctly, Sherlock Holmes-style. You may need to draft new tests. You may need to poke around in the seedy underbelly of the system. You will probably find that there is more information you need to gather as you progress.

Analyse what you have found about the failure. Without jumping to conclusions, draw up a list of code suspects. See if you can spot patterns of events that hint at causes. If possible, keep a record of the inputs and outputs that demonstrate the problem.A good starting point for the investigation is where the error manifests itself - although this is rarely the actual habitat of the fault. Remember, just because a failure exhibits itself in one module that doesn't necessarily mean that that module is to blame. Determining this position is easy if your program crashed, you can use a debugger to get information like the line of code in question, the value of all variables at that point, and what called this function. In the absence of a crash, start from a point you know exhibits incorrect behaviour. Work backwards from there following the flow of control, checking that the code is doing what you expect at each point.

There are a few common bug hunting strategies. The worst is randomly changing things to see the failure goes away. This is an immature approach. (A professional will at least try to make it look scientific!) A far better strategy is divide and conquer. Say you have the fault pinned down to a single function that consists of ten steps. After the fifth print out the intermediate result, or set a breakpoint and investigate it in your debugger. If the value is good then the fault lies in the instructions after this, otherwise it's in the instructions before. Concentrate on those instructions and repeat until you've cornered the fault.

Another technique is the dry run method. Rather than relying on intuition to locate the error, you play the role of the computer, tracing program execution through a trial run, calculating all intermediate values, to get the final result. If your result and reality don't match then you know a fault lies in the code. Although time consuming this can be very effective, highlighting your bad assumptions.

Understand

the real problem once you've found where it's lurking. If it's a simple syntactical error then getting your head round it isn't too bad. For more complex semantic problems make sure you really know what the problem is, and all the ways that it may manifest itself before you move on.

Create a test.

Write a test case for the failure that exercises it. You may have done this in the 'reproduce it' step if you were clever. If you didn't, then you really want to write one now. With your new understanding make sure the test is rigorous.

Fix

the fault. See the following section for a discussion of this part.

Prove

you've fixed it. Now you know why you wrote a test case. Run it, and prove the world is a better place. The test case can be added to your regression test suite to ensure that the fault is never reintroducedat a later point.

Sometimes you try all this but it just doesn't work, you're left wailing and gnashing your teeth, with a sore head from banging it against a brick wall for too long. When things get this bad I always find it helps to explain the whole problem to someone else. Somewhere in the description everything seems to slip into place and I see the one key piece of information I had been missing all along. Try it and see. Perhaps this is why pair programming is such a successful strategy.

How to fix faults

You'll notice that this section is much smaller than the preceding one. Funny that. Usually the whole problem is finding the darned fault. Once you've worked out where it is, then the fix is obvious.

But don't let that lure you into a false sense of security. Don't stop thinking once you've diagnosed the source of your errant behaviour. It's very important not to break anything else as you make the fix - it's surprisingly easy to trample over something in the flower bed as you stroll over to pluck out a weed.

As you modify code always ask yourself 'what are the consequences of this change?' Be aware of whether the fix is isolated to a single statement, or affects other surrounding bits of code. Might the effect of your change ripple out to any code that calls this function, does it subtly alter the behaviour of the function?

Convince yourself that you have really found the cause of the problem, and not just another symptom. Then you can feel confident you've put a fix in the right place. Consider whether similar mistakes may have been made elsewhere in related modules, and go and fix them if necessary^[6].

Finally, try to learn from your mistake. We must learn or otherwise be doomed to repeat the same errors for all eternity. Is it a simple programming error you keep making, or something more fundamental, the incorrect application of an algorithm?

Prevention

Anyone will tell you that prevention is better than a cure. The best way to manage bugs is to not introduce them. Sadly I don't think we'll ever completely reach this ideal, but careful programming can avoid so many problems. Good programming is about discipline and attention to detail.

This section could be enormous, but all prevention advice boils down to one simple statement: Use Your Brain. Enough said.

Conclusion

Like death and taxes, no matter how hard we try to avoid them, bugs happen. Sure, we should use every sort of anti-wrinkle cream available and manipulate our money in cunning ways to mitigate the effects. But if we don't know how to deal with faults when they stare us in the face then our code is doomed.

Debugging is a skill you develop. It doesn't rely on guesswork, but on methodical detection and thoughtful repair.

References

[Simpsons] The Simpsons. Do the Bart Man. 1991, Geffen. GEF87CD.

ANSI/IEEE. IEEE Standard Glossary of Software Engineering Terminology. 1984, ANSI/IEEE Standard 729.

[Gould] John Gould. "Some Psychological Evidence on How People Debug Computer Programs." 1975, International Journal of Man-Machine Studies. No 7.

[Goodliffe4] Pete Goodliffe. "Professionalism in programming #4: Code reviews." C Vu, Volume 12, No 5. ISSN: 1354-3164.

[Goodliffe9] Pete Goodliffe. "Professionalism in programming #9: Defensive programming." C Vu, Volume 13, No 3. ISSN: 1354-3164.

^[1] This isn't necessarily the way it should be. Code inspections, when done, should pick up on a lot of faults that have never had a chance to manifest themselves as failures.

^[2] Provided you have a sane build environment that stops when it encounters an error and provides some reasonable diagnostic messages.

^[3] This presumes that you trust your 'build clean' facility. To be really thorough you can delete the project and check it back out again afresh. Alternatively, manually remove all intermediate object files, libraries and executables. For large projects both of these options can be tedious in the extreme. C'est la vie.

^[4] OK, it is possible to leak memory in a garbage collected language. Hand two objects references to one another and then let go of both of them. Unless you have a very advanced garbage collector they will never be swept up.

^[5] This is certainly the case for Linux, at least until you exhaust the virtual memory address space. At this point malloc may return 0, but the system would probably have keeled over before you got a chance to notice. I'm not sure how Windows works in this respect.

^[6] This is a good reason why "cut and paste" programming is bad - it is far too dangerous. You may end up mindlessly duplicating bugs, which then can't be fixed in one single place.

Notes:

More fields may be available via dynamicdata ..

Journal Articles