Journal Articles

Overload Journal #92 - August 2009 + Design of applications and programs
Browse in : All > Journals > Overload > 92 (7)
All > Topics > Design (236)
Any of these categories - All of these categories

Note: when you create a new publication type, the articles module will automatically use the templates user-display-[publicationtype].xt and user-summary-[publicationtype].xt. If those templates do not exist when you try to preview or display a new article, you'll get this warning :-) Please place your own templates in themes/yourtheme/modules/articles . The templates will get the extension .xt there.

Title: The Generation, Management and Handling of Errors (Part 1)

Author: webeditor

Date: 09 August 2009 09:55:00 +01:00 or Sun, 09 August 2009 09:55:00 +01:00

Summary: An error handling strategy is important for robustness. Andy Longshore and Eoin Woods present a pattern language.

Body: 

In recent years there has been a wider recognition that there are many different stakeholders for a software project. Traditionally, most emphasis has been given to the end user community and their needs and requirements. Somewhere further down the list is the business sponsor; and trailing well down the list are the people who are tasked with deploying, managing, maintaining and evolving the system. This is a shame, since unsuccessful deployment or an unmaintainable system will result in ultimate failure just as certainly as if the system did not meet the functional requirements of the users.

One of the key requirements for any group required to maintain a system is the ability to detect errors when they occur and to obtain sufficient information to diagnose and fix the underlying problems from which those errors spring. If incorrect or inappropriate error information is generated from a system it becomes difficult to maintain. Too much error information is just as much of a problem as too little. Although most modern development environments are well provisioned with mechanisms to indicate and log the occurrence of errors (such as exceptions and logging APIs), such tools must be used with consistency and discipline in order to build a maintainable application. Inconsistent error handling can lead to many problems in a system such as duplicated code, overly-complex algorithms, error logs that are too large to be useful, the absence of error logs and confusion over the meaning of errors. The incorrect handling of errors can also spill over to reduce the usability of the system as unhandled errors presented to the end user can cause confusion and will give the system a reputation for being faulty or unreliable. All of these problems are manifest in software systems targeted at a single machine. For distributed systems, these issues are magnified.

This paper sets out a collection (or possibly a language) of patterns that relate to the use of error generating, handling and logging mechanisms - particularly in distributed systems. These patterns are not about the creation of an error handling mechanism such as [Harrison] or a set of language specific idioms such as [Haase] but rather in the application code that makes use of such underlying functionality. The intention is that these patterns combine to provide a landscape in which sensible and consistent decisions can be made about when to raise errors, what types of error to raise, how to approach error handling and when and where to log errors.

Overview

The patterns presented in this paper form a pattern collection to guide error handling in multi-tier distributed information systems. Such systems present a variety of challenges with respect to error handling, including the distribution of elements across nodes, the use of different technology platforms in different tiers, a wide variety of possible error conditions and an end-user community that must be shielded from the technical details of errors that are not related to their use of the system. In this context, a software designer must make some key decisions about how errors are generated, handled and managed in their system. The patterns in this paper are intended to help with these system-wide decisions such as whether to handle domain errors (errors in business logic) and technical errors (platform or programming errors) in different ways. This type of far-reaching design decision needs careful thought and the intent of the patterns is to assist in making such decisions.

As mentioned above, the patterns presented here are not detailed design solutions for an error handling framework, but rather, are a set of design principles that a software designer can use to help to ensure that their error handling approach is coherent and consistent across their system. This approach to pattern definition means that the principles should be applicable to a wide variety of information systems, irrespective of their implementation technology. We are convinced of the applicability of these patterns in their defined domain. You may also find that they are applicable to systems in other domains - if so then please let us know.

The patterns in the collection are illustrated in Figure 1.

Figure 1

The boxes in the diagram each represent a pattern in the collection. The arrows indicate dependencies between the patterns, with the arrow running from a pattern to another pattern that it is dependent upon. For example, to implement Log Unexpected Errors you must first Make Exceptions Exceptional. In turn, the Logging of Unexpected Errors supports a Big Outer Try Block. You can see that to get the most benefit from the set of patterns it is best to use the whole set in concert.

At the end of the paper, a set of proto-patterns is briefly described. These are considered to be important concepts that may or may not become fully fledged patterns as the paper evolves.

Expected vs. unexpected and domain vs. technical errors

This pattern language classifies errors as 'domain' or 'technical' and also as 'expected' and 'unexpected'. To a large degree the relationship between these classifications is orthogonal. You can have an expected domain error (no funds in the account), an unexpected domain error (account not in database), an expected technical error (WAN link down - retry), and an unexpected technical error (missing link library). Having said this, the most common combinations are expected domain errors and unexpected technical errors.

A set of domain error conditions should be defined as part of the logical application model. These form your expected domain errors. Unexpected domain errors should generally only occur due to incorrect processing or mis-configuration of the application.

The sheer number of potential technical errors means that there will be a sizeable number that are unexpected. However, some technical errors will be identified as potentially recoverable as the system is developed and so specific error handling code may be introduced for them. If there is no recovery strategy for a particular error it may as well join the ranks of unexpected errors to avoid confusion in the support department ('why do they catch this and then re-throw it...').

The table below illustrates the relationship between these two dimensions of error classification and the recommended strategy for handling each combination of the two dimensions, based on the strategies contained in this collection of patterns.

  Expected Unexpected

Domain

  • Handle in the application code
  • Display details to the user
  • Don't log the error

  • Throw an exception
  • Display details to the user
  • Log the error

Technical

  • Handle in the application code
  • Don't display details to the user
  • Don't log the error

  • Throw an exception
  • Don't display details to the user
  • Log the error

Split Domain And Technical Errors

Problem

Applications have to deal with a variety of errors during execution. Some of these errors, that we term 'domain errors', are due to errors in the business logic or business processing (e.g. wrong type of customer for insurance policy). Other errors, that we term 'technical errors', are caused by problems in the underlying platform (e.g. could not connect to database) or by unexpected faults (e.g. divide by zero). These different types of error occur in many parts of the system for a variety of reasons. Most technical errors are, by their very nature, difficult to predict, yet if a technical error could possibly occur during a method call then the calling code must handle it in some way.

Handling technical errors in domain code makes this code more obscure and difficult to maintain.

Context

Domain and technical errors form different 'areas of concern'. Technical errors 'rise up' from the infrastructure - either the (virtual) platform, e.g. database connection failed, or your own artifacts, e.g. distribution façades/proxies. Business errors arise when an attempt is made to perform an incorrect business action. This pattern could apply to any form of application but is particularly relevant for complex distributed applications as there is much more infrastructure to go wrong!

Forces

Solution

Split domain and technical error handling. Create separate exception/error hierarchies and handle at different points and in different ways as appropriate.

Implementation

Errors in the application should be categorized into domain errors (aka. business, application or logical errors) and technical errors. When you create your exception/error hierachy for your application, you should define your domain errors and a single error type to indicate a technical error, e.g. SystemException (see Figure 2). The definition and use of a single technical error type simplifies interfaces and prevents calling code needing to understand all of the things that can possibly go wrong in the underlying infrastructure. This is especially useful in environments that use checked exceptions (e.g. Java).

Figure 2

Design and development policies should be defined for domain and technical error handling. These policies should include:

As an example, consider the exception definitions in Listing 1.

    public class DomainException extends Exception  
    {  
      ...  
    }  
    Public class InsufficientFundsException  
       extends Exception  
    {  
      ...  
    }  
    public class SystemException extends Exception  
    {  
      ...  
    }  
  
Listing 1

A domain method skeleton could then look like Listing 2.

    public float withdrawFunds(float amount)  
       throws InsufficientFundsException,  
       SystemException  
    {  
      try  
      {  
        // Domain code that could generate various  
        // errors both technical and domain  
      }  
      catch (DomainException ex)  
      {  
        throw ex;  
      }  
      catch (Exception ex)  
      {  
        throw new SystemException(ex);  
      }  
    }  
Listing 2

This method declares two exceptions: a domain error - lack of funds to withdraw - and a generic system error. However, there are many technical exceptions that could occur (connectivity, database, etc.). The implementation of this method passes domain exceptions straight through to the caller. However, any other error is converted to a generic SystemException that is used to wrap any other (non-domain) errors that occur. This means that the caller simply has to deal with the two checked exceptions rather than many possible technical errors.

Positive consequences

Negative consequences

Related patterns

Log At Distribution Boundary

Problem

The details of technical errors rarely make sense outside a particular, specialized, environment where specialists with appropriate knowledge can address them. Propagating technical errors between system tiers results in error details ending up in locations (such as end-user PCs) where they are difficult to access and in a context far removed from that of the original error.

Context

Multi-tier systems, particularly those that use a number of distinct technologies in different tiers.

Forces

Solution

When technical errors occur, log them on the system where they occur passing a simpler generic SystemError back to the caller for reporting at the end-user interface. The generic error lets calling code know that there has been a problem so that they can handle it but reduces the amount of system-specific information that needs to be passed back through the distribution boundary.

Implementation

Implement a common error-handling library that enforces the system error handling policy in each tier of the application. The implementation in each tier should log errors in a form that technology administrators are used to in that environment (e.g. the OS log versus a text file).

The implementation of the library should include both:

The library routine that logs technical errors (e.g. technicalError()) should:

Whenever a technical error occurs, the application infrastructure code that catches the error should call the technicalError routine to log the error on its behalf and then create a SystemError object containing a simple explanation of the failure and the unique error instance ID returned from technicalError. This error object should be then returned to the caller as shown in Listing 3.

    ...  
    public class AccountRemoteFacade  
       implements AccountRemote  
    {  
      SystemError error = null;  
      public SystemError withdrawFunds(float amount)  
         throws InsufficientFundsException,  
         RemoteException  
      {  
        try  
        {  
          // Domain code that could generate various  
          // errors both technical and domain  
        }  
        catch (DomainException ex)  
        {  
          throw ex;  
        }  
        catch (Exception ex)  
        {  
          String errorId = technicalError(ex);  
          error = new SystemError(ex.getMessage(),  
             errorId);  
        }  
      }  
      return error;  
    }  
Listing 3

If a technical error can be handled within a tier (including it being 'silently' ignored - see proto-pattern Ignore Irrelevant Errors - except that it is always logged) then the SystemError need not be propagated back to the caller and execution can continue.

Positive consequences

Negative consequences

Related patterns

Unique Error Identifier

Problem

If an error on one tier in a distributed system causes knock-on errors on other tiers you get a distorted view of the number of errors in the system and their origin.

Context

Multi-tier systems, particularly those that use load balancing at different tiers to improve availability and scalability. Within such an environment you have already decided that as part of your error handling strategy you want to Log at Distribution Boundary.

Forces

Solution

Generate a Unique Error Identifier when the original error occurs and propagate this back to the caller. Always include the Unique Error Identifier with any error log information so that multiple log entries from the same cause can be associated and the underlying error can be correctly identified.

Known uses

The authors have observed this pattern in use within a number of successful enterprise systems. We do not know of any publicly accessible implementations of it (because most systems available for public inspection are single tier systems and so this pattern is not relevant to them).

Implementation

The two key tenets that underlie this pattern are the uniqueness of the error identifier and the consistency with which it is used in the logs. If either of these are implemented incorrectly then the desired consequences will not result.

The unique error identifier must be unique across all the hosts in the system. This rules out many pseudo-unique identifiers such as those guaranteed to be unique within a particular virtual platform instance (.NET Application Domain or Java Virtual Machine). The obvious solution is to use a platform-generated Universally Unique ID (UUID) or Globally Unique ID (GUID). As these utilize the unique network card number as part of the identifier then this guarantees uniqueness in space (across servers). The only issue is then uniqueness across time (if two errors occur very close in time) but the narrowness of the window (100ns) and the random seed used as part of the UUID/GUID should prevent such problems arising in most scenarios.

It is important to maintain the integrity of the identifier as it is passed between hosts. Problems may arise when passing a 128-bit value between systems and ensuring that the byte order is correctly interpreted. If you suspect that any such problems may arise then you should pass the identifier as a string to guarantee consistent representation.

The mechanism for passing the error identifier will depend on the transport between the systems. In an RPC system, you may pass it as a return value or an [out] parameter whereas in SOAP calls you could pass it back in the SOAP fault part of the response message.

In terms of ensuring that the unique identifier is included whenever an error is logged, the responsibility lies with the developers of the software used. If you do not control all of the software in your system you may need to provide appropriate error handling through a Decorator [Gamma95] or as part of a Broker [Buschmann96]. If you control the error framework you may be able to propagate the error identifier internally in a Context Object [Fowler].

Positive consequences

Negative consequences

Related patterns

To be continued...

So far, so good. However this is only part of the story as there are still some fundamental principles to be applied such as determining what is and is not an error. The remaining patterns in this pattern collection (Big Outer Try Block, Hide Technical Error Detail from Users, Log Unexpected Errors and Make Exceptions Exceptional) will show how the error handling jigsaw can be completed. These patterns will be explored in the next issue.

References

[Buschmann96] Pattern-Oriented Software Architecture, John Wiley and Sons, 1996

[Dyson04] Architecting Enterprise Solutions: Patterns for High-Capability Internet-based Systems, Paul Dyson and Andy Longshaw, John Wiley and Sons, 2004

[Gamma95] Design Patterns, Addison Wesley, 1995.

[Haase] Java Idioms - Exception Handling, linked from http://hillside.net/patterns/EuroPLoP2002/papers.html.

[Harrison] Patterns for Logging Diagnostic Messages, Neil B. Harrison

[Renzel97] Error Handling for Business Information Systems, Eoin Woods, linked from http://hillside.net/patterns/onlinepatterncatalog.htm

Notes: 

More fields may be available via dynamicdata ..