Title: Professionalism in Programming #23

Author:

Date: 05 December 2003 13:16:02 +00:00 or Fri, 05 December 2003 13:16:02 +00:00

Summary:

To err is human.

Body:

We know that the only way to avoid error is to detect it, that the only way to detect it is to be free to inquire. (J. Robert Oppenheimer)

At some point in their life everyone has this epiphany: the world doesn't always work as you expect. My one year old friend Tom learnt when climbing a chair four times his size. The equal and opposite reaction came as quite a shock; he ended up under a pile of furniture.

Is the world broken? It is wrong? No. The world has happily plodded along in its own way for the last few million years, and looks set to continue for the foreseeable future. It's our expectations that are wrong and need adjustment. As they say: bad things happen, so deal with it. We must write code that deals with the Real World and its unexpected ways.

This is particularly difficult because the world mostly works as we'd expect, constantly lulling us into a false sense of security. The human brain is wired to cope, with inbuilt fail-safes. If someone bricked up your front door you'll stop automatically, before walking into an unexpected wall. Programs are not so clever; we have to tell them what to do when it all goes wrong.

From whence it came

To expect the unexpected shows a thoroughly modern intellect. (Oscar Wilde)

Errors can and will occur. In a large program, an error may occur for one of a thousand reasons. But it will fall into one of these three categories:

User error: The stupid user manhandled your lovely program. Perhaps they provided the wrong input, or attempted an operation that's patently absurd. A good program will point out the mistake, and help the user rectify it. It won't insult them, or whinge in an incomprehensible manner.
Programmer error: The user pushed all the right buttons, but the code's broken. This is a bug, a fault the programmer introduced that the user can do nothing about (except try to avoid it in the future).
Exceptional circumstances: The user pushed all the right buttons, and the programmer didn't mess up. Fate's fickle finger intervened, and we ran into something that couldn't be avoided. Perhaps a network connection failed, we ran out of printer ink, or there's no hard disk space left. These are the most common forms of errors and, unfortunately, the hardest to deal with.

Each of these error categories has a different audience. Users don't want to be bothered by programmer errors - there's nothing they can do about them anyway (short of never using that piece of tatty software again).

In our code we need a well-defined strategy to manage each kind of error. An error may be detected and reported to the user in a pop-up message box, or it may be detected by a middle-tier code layer, and signaled to the client code programmatically. At both levels the same principles apply. It may be a human choosing how to handle the problem, or lower down the food chain it's your code making a decision - someone is responsible for acknowledging and acting on errors. Errors are generally raised by subordinate components and communicated upwards, to be dealt with by the caller.

Errors are reported in a number of ways; we'll look at these in the next section. To take control of program execution we need to be able to:

raise an error when something goes wrong,
detect all possible error reports,
handle them appropriately, and
propagate errors we can't handle.

Each of these tasks are addressed in the subsequent sections.

Errors are hard to deal with. The error you encounter is often not related to what you were doing at the time (they are mostly 'exceptional circumstances'). They are also tedious to deal with - we want to focus on what our program should be doing, not on how it may go wrong. However, without good error management your program will be brittle - built upon sand not rock; at the first sign of wind or rain it will collapse.

Take error handling seriously. The stability of your code rests upon it.

Error reporting mechanisms

There are several common strategies for propagating error information to client code. You'll run into code that uses each of them, so you must know how to speak every dialect. Some reporting mechanisms are particularly suited to certain languages or operating environments, as we'll see. This list contains nothing unexpected, but we should take time to understand how the techniques compare, and which is most effective for any given situation.

None

The simplest error reporting mechanism is: don't bother. This works wonderfully in the cases where you want your program to behave in bizarre unpredictable ways, and crash randomly.

If you encounter an error and don't know what to do about it, blindly ignoring it is not a viable option. You probably can't continue your work, and returning without fulfilling your function's 'contract' will leave the world in an undefined and inconsistent state.

Never ignore an error condition.

If you don't know how to handle the problem, then signal a failure back up to the calling code. Don't sweep an error under the carpet and hope for the best.

An alternative to ignoring errors is to instantly abort the program upon encountering a problem. It's easier than handling errors throughout the code, but hardly a well-engineered solution! [Though for some situations it can be the best solution - JAD]

Return values

The next most simple approach is to return a success/failure value from the function. What that value gets set to depends on: the number of ways a function may fail, the diligence of the programmer, and any project conventions. Most functions will return a boolean value, a simple yes or no answer. More advanced mechanisms enumerate the possible ways a function can exit and return a status value to signify this, known as a reason code. One of those reason code values will mean 'success', the rest represent the many and varied abortive cases. This enumeration may be shared across the whole codebase, in which case your function returns a subset of the available values. You should therefore document what the caller can expect.

Whilst this approach works well for procedures that don't return other computed values, passing error codes back with returned data gets messy. For example, if you have a function int count() that walks down a linked list and returns the number of elements, how can it signify a list structure corruption alongside that int return value? There are three approaches:

Return a compound data type (or tuple) containing both the return value and an error code. This is easier in some languages than others. Whilst reasonably elegant, this technique is seldom used, it is rather clumsy in the popular C-like languages.
You could pass the error code (and/or the original return value) back hrough a function parameter. In C++ this parameter would be passed by reference. In C you'd indirect the variable access with pointers. This approach is ugly and non-intuitive; there is no syntactic way of distinguishing a return value from an error code.
The alternative is to reserve a range of the return values to signify failure. The count example above could nominate all negative numbers as error reason codes. They'd be meaningless answers, anyway. Negative integer values are a common choice for this. Pointer return values may be given a specific 'invalid' value, which by convention is zero (or NULL). In Java you could return a null object reference.

This technique doesn't always work well. Sometimes it's hard to reserve an error range - all return values are equally meaningful and equally likely. It also has the side effect of reducing the available range of 'good' values; the use of negative values reduces the possible positive values by an order of magnitude^[1].

Error status variables

This approach attempts to manage the contention between a function's return value and its error status report. Rather than return a reason code the function sets a shared, global, error variable. After calling the function you must then inspect this status variable to find out whether it completed successfully or not.

The shared variable reduces confusion and clutter in the function's signature, and doesn't restrict the return value's data range at all. However, errors signaled through a separate channel are much easier to miss or to wilfully ignore. A shared global variable also has some nasty thread safety implications.

The C standard library employs this technique with its errno variable - it's a good example of why error status variables are a bad idea. Its use is fraught with peril, nowhere near as simple as a function return value. You must clearly understand the semantics of its operation: before using any standard library facility you have to manually clear errno. They never set a 'succeeded' value to errno. This is a common source of bugs, and makes calling each library function more tedious. To add insult to injury, not all C standard library functions use errno, so it is less than consistent.

A shared error variable is functionally equivalent to the previous reporting mechanism, but has enough disadvantages to make you avoid it. Don't write your own error reporting mechanism this way, and use existing implementations with the utmost care.

Exceptions

Exceptions are a structured language-level facility to manage errors. They help to distinguish the 'normal' flow of execution from 'exceptional' cases. When your code encounters a problem it can't handle at that point, it stops dead and throws an error message up in the air. The language runtime then automatically steps back up the call stack until it reaches some handler code. The error message lands there, and the program gets a chance to handle the problem.

There are two operational models:

the termination model (provided by C++ and Java), where execution carries on after the handler code that caught the exception, and
the resumption model, where execution carries on from where the exception was raised.

The former model is easier to reason about, but it doesn't give ultimate control. It only allows fault handling (you can execute code when you notice an error), not fault rectification (a chance to remove the cause of the problem and try again).

An exception cannot be ignored. If it isn't caught and handled, it will propagate to the very top of the call stack and stop the program. The language runtime automatically cleans up as it unwinds the call stack. This makes exceptions a tidier and safer alternative to handcrafted error handling code. However, throwing exceptions through sloppy code can lead to memory leaks and problems with resource cleanup^[2]. You must take care to write exception-safe code. The sidebar explains what this means in more detail.

Code that handles an exception is distinct from the code that raises it, and may be arbitrarily far away. Exceptions are usually implemented in OO languages, where error messages can be defined by a hierarchy of exception classes. A handler can elect to catch a quite specific class of error (by accepting a leaf class), or a more general category of error (by accepting a base class). Exceptions can be the only error reporting mechanism available in certain situations - how else can you signal an error in a constructor?

Exceptions don't come for free; the language support incurs a performance penalty. In reality this isn't very significant, and is only ever seen in the presence of the exception handling statements. Exception handlers reduce potential for optimisation opportunities. Throwing an exception is an expensive operation, so they should only be used for genuinely exceptional events.

Signals

Signals are a more 'extreme' reporting mechanism - largely used for errors signaled by the execution environment to the running program. The operating system traps a number of exceptional events, like a floating point exception triggered by the maths coprocessor. These well-defined error events are delivered to the application as a 'signal'. A signal interrupts the program's normal flow of execution and jumps into a nominated signal handler function. Your program can receive a signal at any time, and the code must be able to cope with this. When the signal handler completes, program execution continues at whatever point it was interrupted.

Signals are almost the software equivalent of a hardware interrupt. They are a Unix concept, now provided on pretty much every platform (a basic version is a part of the ISO C standard). The operating system provides sensible default handlers for each signal (some of which do nothing, others abort the program with a neat error message). You can override these with your own handler.

The defined signal events include: program termination, execution suspend/continue requests, and maths errors. Some environments extend the basic list with many more events.

Whistle-stop tour of exception safety

In languages without automatic cleanup facilities, resilient code must be exception-safe. It must work correctly no matter what exceptions come its way, for some definition of 'correctly' (we'll define this below). It doesn't matter whether the code handles any exceptions or not.

Exception neutral code propagates all exceptions up to the caller; it won't consume or change anything. This is an important concept for 'generic' programs, like C++ template code - the template types may generate all sorts of exceptions that template implementors don't understand.

There are several different levels of exception 'safety'. They are described in terms of guarantees to the calling code. These guarantees are:

Basic guarantee: If exceptions do occur in a function (resulting from an operation we perform, or the call of another function) we will not leak resources. The code state will be consistent (i.e. it can still be used correctly), but will not necessarily be left in a known state. For example, a member function should add ten items to a container, but an exception propagates through it. We guarantee the container is still usable, but maybe no objects were inserted, maybe all ten were, or perhaps every other object was added.
Strong guarantee: This is far stricter than the basic guarantee. Here we ensure that if an exception propagates through our code the program state will remain completely unchanged. The object hasn't been altered, no global variables changed, nothing. In our example above, we can assert that no objects will have been inserted into the container at all.
Nothrow guarantee: The final guarantee is the most restrictive. We guarantee that an operation can never ever throw an exception. If we are 'exception neutral' then this implies that the function cannot call any function that itself might throw.

Which of the guarantees you provide is entirely your choice. The more restrictive the guarantee, the more widely (re-)usable the code is. In order to implement the strong guarantee you will generally require the use of a number of functions that provide the nothrow guarantee.

Most notably, every destructor you write should always honour the nothrow guarantee. Always. Otherwise all exception-handling bets are off. In the presence of an exception, object destructors will be called automatically as the stack is unwound. Raising an exception whilst handling an exception is not permissible.

Each of these mechanisms has different implications for the locality of error. An error is local in time if it is discovered very soon after it is created. An error is local in space if it is identified very close to (or even at) the site where the error actually manifests. Some approaches specifically aim to reduce the locality of error to make it easier to see what's going on (e.g., error codes). Others (like exceptions) aim to extend the locality of error so code doesn't get entwined with error handling logic.

The favoured type of reporting mechanism may be an architectural decision. It might be considered important to define a homogeneous hierarchy of exception classes, or a central list of shared reason codes.

Detecting errors

How you detect an error obviously depends on the mechanism reporting it. In practical terms, this means:

Return values

You determine whether a function failed by looking at its return code. This failure test is bound tightly to the act of calling the function; by making the call you are implicitly checking its success. Whether you do anything with that information is up to you. [Though I am working on a proposal for C++ that would allow the author of a function to specify that the return type could not be ignored. - JAD]

Error status variables

After calling a function which sets an error status variable, you must inspect this variable. If it follows C's errno model of operation you needn't actually test for error after every single function call. Reset errno first, then call any number of standard library functions back-to-back. Afterwards, inspect errno. If it contains an error value, then one of those functions failed. Of course, you don't know which one fell over, but if you don't care about that level of detail, this is a slightly streamlined error detection approach.

Exceptions

If an exception propagates out of a subordinate function, you can chose to catch and handle it, or to ignore it and let the exception flow up a level. You can only make an informed choice when you know what kinds of exception might be thrown. You'll only know this if it's been documented (and if you trust the documentation).

Java's exception implementation places this documentation in the code itself. The programmer has to write an exception specification for every method, describing what it can throw; it is a part of the function's signature. Java is the only mainstream language to enforce this approach. You cannot leak an exception that isn't in the list, the compiler performs static checking to prevent this from happening^[3].

Signals

There's only one way to detect a signal: install a handler for it. There's no obligation. You can choose not to install any signal handlers at all, and accept the default behaviour.

As various bits of code converge in a large system, you will probably need to detect errors in more than one way, even within a single function.

Whichever detection mechanism you use, the key point is this:

Never ignore any errors that might be reported to you.

If an error report channel exists, it's there for a reason.

When you let an exception propagate through your code you are not ignoring it - you can't ignore an exception. You are allowing it to be handled by a higher level. The philosophy of exception handling is quite different in this respect.

Even if you think that an error has no implication for the rest of your code, it is a good practice to write the detection scaffolding anyway, and to not take any action in the handler. This makes it clear to a maintenance programmer that you are fully aware how the function may fail, and you have consciously chosen to ignore any failures.

Handling errors

Love truth, and pardon error. (Voltaire)

Errors happen. We've seen how to discover them, and when to do so. The question now is: what do you do about them? This is the hard part. The answer depends largely on circumstance and the gravity of an error - whether it's possible to rectify the problem and retry the operation, or to carry on regardless. Often there is no such luxury; the error may even herald the 'beginning of the end'. The best you can do is clean up and exit sharply, before anything else goes wrong.

To make this kind of decision you must be informed. You need to know a few key pieces of information about the error:

Where

it came from (which is quite distinct from where it's going to be handled). Is the source a core system component, or a peripheral module? This information may be encoded in the error report, or else we know what function was called and can figure it out manually.

What

you were trying to do. What provoked it? This may give a clue toward any remedial action. We probably only know this from our understanding of the error's context - you know what function was called. Error reporting seldom contains this kind of information.

Why

it went wrong, and the nature of the problem. This only makes sense in the context of the error's source, and what was being done. You need to know exactly what has happened, not just a general hand-wavy class of error. It's important to know how much of the erroneous operation completed - all or none are nice answers, but generally the program will be in some indeterminate state between the two.

When

it happened. This is the locality of the error in time. Has the system just failed, or has a two-hour old problem only just been spotted?

The severity

of the error. Some problems are more serious than others, but when detected one error is equivalent to any other - we can't continue without understanding and managing the problem.

The level of severity is usually the caller's opinion, based on how easy it will be to recover or work around the error. If it's not a big deal, the strategy may just be to live with the problem. If it affects core functionality this isn't acceptable; the code must do everything possible to fix the problem and continue as if nothing happened.

How

to fix it. This may be obvious (e.g., insert a floppy disk and retry), or may not (e.g., you need to modify the function parameters so they are consistent). More often than not we have to infer this knowledge from the other information collated.

Given this depth of information you can formulate a strategy to handle each error. Forgetting to insert a handler for any potential error will lead to a bug - it might be a hard to exercise bug, and hard to track down - so think about every error condition carefully.

When to deal with errors

So when should you handle each error? This can be separate from when it's detected. There are two schools of thought.

As soon as possible

Handle each error as you detect it. Since the error is handled near to its cause you retain important contextual information, making the error handling code clearer. This is a well-known self-documenting code technique. Managing each error near its source means there's less code which control passes through in an 'invalid' state; too much of that leads to very dense logic.

This is usually the best option for functions that return error codes.

As late as possible

Alternatively, defer error handling as long as possible. This recognises that code detecting an error rarely knows what to do about it. Often it depends on the context it is being used in: a missing file error may be reported to the user when opening a document, but silently handled when hunting for a preferences file.

Exceptions are ideal for this approach, you can ignore an exception until you know how to deal with the error. This separation of detection and handling may be clearer, but can make code more complex. It's not obvious that you are deliberately deferring error handling, nor is it clear where an error came from by the time you do handle it.

In theory, it's nice to separate 'business logic' from error handling. Often you can't, as cleanup is necessarily entwined with that business logic. It can be more tortuous to write the two separately. However, centralising error handling code has advantages: you know where to look for it, and can put the abort/continue policy in one place, rather than scattered through many functions.

Thomas Jefferson opined "delay is preferable to error". There is truth there, the actual existence of error handling is far more important than when an error is handled. Nevertheless, choose a compromise that's close enough to prevent obscure and out of context error handling, whilst being far enough away to not cloud 'normal' code with labyrinthine paths and error handling dead ends.

Handle each error in the most appropriate context, as soon as you know enough to handle it correctly.

This is usually the context that created the error.

Possible reactions

You've caught an error. You're poised to handle it. What are you going to do now? Hopefully, whatever is required for correct program operation. Whilst we can't possibly list every recovery technique under the sun, here are the common reactions to consider.

Logging

Any reasonably large project should already be employing a logging facility. It allows you to collect important trace information, and is an entry point for the investigation of nasty problems.

The log exists to record interesting events in the life of the program, to allow you to delve into the inner workings and reconstruct paths of execution. For this reason all errors you encounter should be detailed in the program log; they are one of the most interesting and telling events of all. Aim to capture all pertinent information - as much of the list above as you can.

For really obscure errors that predict catastrophic disaster, it may be a good idea to get the program to 'phone home' - to transmit either a snapshot of itself, or a copy of the error log, to the developers for further investigation.

What you do after logging is another matter...

Reporting

A program should only report an error to its user when it doesn't know what else to do. The user does not need to be bombarded by a thousand small nuggets of useless information, or be badgered by a raft of pointless questions. Save the interaction for when it really is vital. Don't report when you encounter a recoverable situation. Log the event by all means, but keep quiet about it. Provide a mechanism for the user to read the event log if you think they might care one day.

There are some problems that only the user can fix. For these it is good practice to report the problem immediately, to give the best chance to resolve the situation, or to decide how to continue.

Of course, this kind of reporting depends on whether the program is interactive or not. Deeply embedded systems are expected to cope on their own.

Recovery

Sometimes your only course of action is to stop immediately. But not every error spells doom. Some are quite expected. If your program saves a file, one day the disk will fill up and the save operation will fail. The user expects your program to continue faultlessly under these situations, so be prepared.

If your code encounters an error and doesn't know what to about it, pass the error upwards. It's more than likely your caller will have the ability to recover.

Ignore

I only include this for completeness. Hopefully by now you've learnt to scorn the very suggestion of ignoring an error. If you choose to forget all about handling it, and to continue with your fingers crossed: good luck. This is where most of the bugs in any software package will come from. Ignoring an error whose occurrence may cause the system to misbehave inevitably leads to hours spent debugging. Ignoring errors does not save time.

You'll end up spending far longer working out the cause of bad program behaviour than you ever would have spent writing the error handler.

You can, however, write code that allows you to do nothing when an error crops up. Is that a blatant contradiction of what you just read? No. It is possible to write code that copes with the world not being right, that can carry on correctly in the face of an error, but it often gets quite convoluted. If you adopt this approach, you must make it clear in the code. Don't risk it being misinterpreted as ignorant and incorrect.

Propagate

When a subordinate function call fails you probably can't carry on, but don't know what else to do. The only option is to clean up, and propagate the error report upwards. You have options. When propagating an error you can either

export the same error information you were fed, or
reinterpret the information, sending a more meaningful message to the next level up.

Ask yourself this question: does the error relate to a concept exposed (directly, or indirectly) through the module interface? If so, it's OK to propagate that same error. Otherwise, recast it in the appropriate light, choosing an error report that makes sense in the context of your module's interface.

This is a good self-documenting code technique. For example, you can catch and wrap up exceptions, or return a different reason code to the one you received.

Crafting error messages

Inevitably your code will encounter an error that the user has to sort out. Human intervention may be the only option; your code can't insert a floppy disk by itself or switch on the printer. (If it can, you'll make a fortune!)

If you're going to whinge at the user, there are a few general points to bear in mind.

Users don't think like programmers, so present information the way they'd expect. When displaying the free space on a disk you might print this: Disk space: 10K. If there's no space left, a zero could be misread as 'OK' - the user will not be able to fathom why they can't save a file when the program says everything's fine.
Make sure your messages aren't too cryptic. You might understand them. Can your granny? (It doesn't matter if your granny won't use this program, it will almost certainly be driven by someone with a lower intellect than her.)
Don't present meaningless error codes (unless as some 'additional info' to send to the developers). No user knows what to do when faced with an Error code 707E.
Make it clear what's an error and what's a mere warning. You could include this in the message text (perhaps with an Error: prefix), and can emphasise it in message boxes with an accompanying icon.
Only ask a question (even a simple one like Continue: Yes/No?) if the user fully understands the ramifications of each choice. Explain it if necessary.

What you present to the user will be determined by interface constraints, and application or OS style guides. If your company has a user interface engineer, then it's their job to make these decisions. Work with them.

Code implications

Show me the code! Let's spend some time investigating the implications of error handling in our code. As we'll see, writing good error handling that doesn't twist and warp the underlying program logic is not a simple task.

Starting off in the world of C, the first piece of code we'll look at is a common error handling structure. Yet it's not a particularly intelligent approach for writing error-tolerant code. Our basic aim is to call three functions sequentially - each of which may fail - performing some intermediate calculations along the way. Spot the problems with this:

void nastyErrorHandling() {
  if (operationOne()) {
    ... do something ...
    if (operationTwo()) {
      ... do something else ...
      if (operationThree()) {
        ... do more ...
      }
    }
  }
}

Syntactically it's fine; the code will work. Practically, it's an unpleasant style to maintain. The more operations you need to perform, the more deeply nested the code gets, the harder it is to read. This kind of error handling quickly leads to a rat's nest of conditional statements. It doesn't reflect the actions of the code well; each intermediate calculation could be considered the same level of importance, yet they are nested at different levels.

Can we avoid these problems? Yes - there are a couple of alternatives. The first variant flattens the nesting. It's semantically eq uivalent, but introduces some new complexity, since flow control is now dependent on the value of a new 'status' variable, ok:

void flattenedErrorHandling() {
  bool ok = operationOne();
  if (ok) {
    ... do something ...
    ok = operationTwo();
  }
  if (ok) {
    ... do something else ...
    ok = operationThree();
  }
  if (ok) {
    ... do more ...
  }
  if (!ok) {
    ... clean up after errors ...
  }
}

We've also added an opportunity to clean up after any errors. Is that sufficient to mop up all failures? Probably not; the necessary cleanup may depend on how far we got through the function before lightning struck. There are two C-style cleanup approaches:

Perform a little cleanup after each operation that may fail, then return early. This inevitably leads to duplication of cleanup code. The more work you've done, the more you have to clean up, so each exit point will need to do gradually more unpicking.

If each operation in our example allocates some memory, each early exit point will have to release all allocations made to date. The further in, the more releases. That will lead to some quite dense and repetitive error handling code, and makes the function far larger and far harder to understand.
Write the cleanup code once, at the end of the function, but write it in such a way as to only clean up what's dirty. This is neater, but if you inadvertantly insert an early return in the middle of the function, the cleanup code will be bypassed.

If you're not anal about writing Single Entry, Single Exit (SESE) functions, this next example removes the reliance on a separate control flow variable^[4]. We do lose the clean up code again, though. Simplicity renders this a better description of the actual intent:

void shortCircuitErrorHandling() {
  if (!operationOne()) return;
    ... do something ...
  if (!operationTwo()) return;
    ... do something else ...
  if (!operationThree()) return;
    ... do more ...
}

A marriage of this 'short circuit' exit with the requirement for cleanup leads to the following approach, especially seen in low level C systems code. Some people advocate it as the only valid use for the maligned goto. I'm still not convinced

void gotoHell() {
  if (!operationOne()) goto error;
    ... do something ...
  if (!operationTwo()) goto error;
    ... do something else ...
  if (!operationThree()) goto error;
    ... do more ...
  return;
  error:
    ... clean up after errors ...
}

In C++ you can avoid such monstrous code using RAII (Resource Acquisition Is Initialisation) techniques, like smart pointers [Stroustrup97]. This has the added bonus of providing exception safety - when an exception terminates your function prematurely, resources are deallocated automatically. These techniques avoid a lot of the problems we've seen above, moving complexity to a separate flow of control.

The same example using exceptions would look like this, presuming that the subordinate functions do not return values, but throw an exception.

void exceptionalHandling() {
  try {
    operationOne();
    ... do something ...
    operationTwo();
    ... do something else ...
    operationThree();
    ... do more ...
  }
  catch (...) {
    ... clean up after errors ...
    ... and probably rethrow ...
  }
}

This is only the most basic example. A sound code design wouldn't need the try/catch block at all. Writing good code in the face of exceptions requires an understanding of principles beyond the scope of this article.

Raising hell

We've put up with other people's errors for long enough. It's time to turn the tables and play the bad guy. It's pitifully clear that when writing a function, erroneous things happen that you'll need to signal to your caller. Make sure you do - don't swallow any failure silently. Even if you're sure that caller won't know what to do in the face of the problem, they must be kept informed. Don't write code that lies, and pretends to be doing something it's not.

Which reporting mechanism should you use? It's largely an architectural choice; obey the project conventions, and the common language idioms. In C++ and Java it is common to favour exceptions, but only use them if the rest of the project does. A C++ architecture may choose to forego this facility to allow portability to platforms with no exception support.

One aspect of error raising is the propagation of errors from subordinate function calls. We've seen strategies for this already. Our main concern here is reporting fresh problems encountered during execution. How you determine these errors is your own business, but when reporting them consider:

Have you cleaned up appropriately first? Reliable code doesn't leak resource, or leave the world in an inconsistent state, even when an error occurs - unless it's really unavoidable. If you do, it should be documented carefully. Consider what will happen after this error has manifested; when your code is next called, ensure it will still work.
Make sure your error doesn't leak inappropriate information to the outside world. Only return useful information that the caller understands and can act on.
Use exceptions correctly. Don't throw an exception for 'unusual' return values - the rare but not erroneous cases. Use exceptions only to signal exceptional execution circumstances. Don't use them for flow control; that is an abuse^[5].
Consider using assertions if you're trapping an error that should 'never' happen in the normal course of program execution, a genuine programming error.
If you can push any tests to compile time, then do so. The sooner you detect and rectify an error, the less hassle it can cause. Compile-time assertions can be used in both C and C++.
Make it hard for people to ignore your errors. Given half a chance someone will use your code badly. Exceptions are good for this - you have to be quite deliberate to hide an exception.

What errors should we be looking out for? This obviously depends on what the function's doing. Here's a checklist for the general kinds of error checking you should be doing in each function:

Check all function parameters. Ensure you have been given correct and consistent input. Consider using assertions for this, depending on how strictly your contract was written (i.e. if it is an 'offence' to supply bad parameters),
Check invariants are satisfied at interesting points in execution.
Check all values from external sources for validity before you use them. File reading and the user's input values should be sensible, with no bits missing.
Check the return status of all system calls and other subordinate function calls.

Managing errors

The common principle uniting the raising and handling of errors is to have a consistent strategy for dealing with failure, wherever it manifests. These are considerations for managing the occurrence, detection and handling of program errors:

Try to avoid things that could cause an error. Can you do something guaranteed to work instead? For example, avoid allocation errors by reserving enough resource beforehand. With an assured pool of memory your routine cannot suffer memory restrictions. Naturally, this will only work when you know how much resource you need up front; many times you do.
Define the program or routine's expected behaviour under abnormal circumstances. This determines how robust the code needs to be, and therefore how thorough your error handling should be. Can a function silently generate bad output, subscribing to the historic GIGO^[6] principle?
Define clearly which components are responsible for handling which errors. Make it explicit in the module's interface. Ensure a caller knows what will always work and what may one day fail.
Check your programming practice: when do you write error handling code? Don't put it off for later, you'll forget to handle something. Don't wait until your development testing highlights problems before writing handlers - that's not an engineering approach.

Write all error detection and handling now, as you write the code that may fail. Don't put it off until later. If you must be evil and defer handling, always write the detection scaffolding now.
When trapping an error, have you found a symptom or a cause? Consider whether you've discovered the source of a problem which needs rectifying here, or have discovered a symptom of an earlier problem. If it's the latter then don't write reams of handling code here, put that in a more appropriate error handler.

Conclusion

To err is human; to repent, divine; to persist, devilish. (Benjamin Franklin)

To err is human (but computers seem quite good at it, too). To handle this error is divine.

Every line of code we write should be balanced by appropriate and thorough error-checking and handling. A program without rigorous errorhandling will not be stable. One day an obscure error may occur, and the program will fall over as a result.

Handling errors and failure cases is hard work. Its bogs programming down in the mundane details of the Real World. However, it's absolutely essential. As much as 90% of the code you write will be handling exceptional circumstances [ShawBentley82]. That's quite a surprising statistic, so write code expecting to put far more effort into the things that can go wrong than the things that will go right.

Homework

Here are a couple of questions to mull over, and discuss on accu-general.

How should you handle the occurrence of errors in your error-handing code?
Are return values and exceptions equivalent mechanisms? Prove it.

References

[ShawBentley82] Bentley, Jon Louis. Writing Efficient Programs. Prentice Hall Professional, 1982. ISBN: 013970244X

[Stroustrup97] Stroustrup, Bjarne. The C++ Programming Language, Third Edition. Addison Wesley, 1997. ISBN: 0-201-88954-4

^[1] If you used an unsigned int you'd have a power of two more values available, reusing the signed int's sign bit.

^[2] For example, you could allocate a block of memory, and then exit early as an exception propagates through. The allocated memory would leak. This kind of problem makes writing code in the face of exceptions a more complex business.

^[3] C++ also supports exception specifications, but leaves their use optional. It's idiomatic to avoid them - for performance reasons, among others. Unlike Java, they are enforced at run time.

^[4] Although this clearly isn't SESE, I contend that the previous example isn't, either. There is only one exit point, at the end, but the contrived control flow is simulating early exit - it's as good as multiple exit. This is a good example of how being bound by a rule like SESE can lead to bad code, unless you think carefully about what you're doing.

^[5] I've seen people break a loop or end recursion by throwing exceptions. This uses an exception like a non-local goto. It's a curiosity, but a plain wrong use of exceptions.

^[6] That is, Garbage In Garbage Out - feed it rubbish, and it will happily spit out rubbish.

Notes:

More fields may be available via dynamicdata ..

Journal Articles