Title: Software Engineers Toolbox

Author:

Date: 03 April 1996 13:15:27 +01:00 or Wed, 03 April 1996 13:15:27 +01:00

Summary:

Body:

The Preprocessor - Part 1

When we think about compiling a program, we tend to think of the whole process (starting with a file of C source code and finishing with an executable program image) as a single, monolithic process. This view is reinforced by the fact that often we need only issue a single command (such as cl, cc or bcc) to go from the source file to an executable program. (I gave an detailed overview of the whole process in an earlier, six-part series, starting in CVu, Vol 5 Issue 5.) In fact, the processes of compilation is comprised of many distinct activities performed in series. The C standard defines eight phases of compilation. This is a bit much for our purpose here, so I will simply divide it into three major blocks; preprocessing, translation and linking. Translation is the principle activity, which converts a text file of valid C tokens into an object file (a partial binary program image).

The linker joins these object files into a single executable unit. The file which the translator takes as its input is called a translation unit. This is not (in general) the same as the file of C code - the source file - that you created with your editor. Although the source file and translation unit are both text files, they can be quite different. The purpose of the preprocessor is, as we shall see, to convert the source file that you wrote into a translation unit that the translator can handle.

Provided an ISO C compiler obeys the rules of translation, it does not have to provide a separate preprocessor. The functionality of the preprocessor and the translator (and the linker, for that matter) could be built into a single, monolithic program. In practice, this is both unnecessary and undesirable.

Usually, the compiler will be implemented as a set of three, four or even more programs that run sequentially. Each program creates a temporary output file which is used as the input to the next program in the chain. Normally, these temporary files are transparent to the user but most compilers have options which allow you to save some of them (or at least a representation of them) to a text file.

This can be a very useful way to learn more about the way the preprocessor works. Most compilers let you create a preprocessor output file, which is more-or-less the translation unit that the translator sees. By comparing this with the original source file, you can see just what transformations the preprocessor makes.

Why a Preprocessor?

Now that I've shown where the preprocessor fits into the scheme of things and broadly hinted at what it does, the obvious question must be, why do we need one in the first place? The glib answer would be 'To make up for some shortcomings in the language.'

That would be at least half true. Some of the features that the preprocessor is used for can, and probably should, be done in the translator. (Some may even make it into the next revision of the standard.) However, given that we have to live with a certain core language definition, let's see what a preprocessor can do for us.

Probably the single most common use of the preprocessor is file inclusion. There are often lines of code, such as function prototypes, type definitions and global variable declarations, which we wish to include in more than one file of a project. It is impractical (and dangerous) to simply add the same lines to every file. Maintenance would be (is!) a nightmare. The answer is to put all the common code in a file of its own, then simply include the text of that file wherever it is needed in other files.

manifest constants. A manifest constant is one where a constant value is associated with a name. The code then uses the name instead of the value. This is important for several reasons. First, numbers by themselves give no clue to their purpose. By using a name instead, you can provide meaning. Seeing an array definition with a size of 10 tells you little.

Seeing a size of MAX_BUTTONS tells you much more. Second, you may need to use a certain value in many places. If you use the 'raw' number, changing it becomes very difficult. You can't just change all occurrences of 10 to 12, say, because 10 may have been used for many different reasons. You would have to laboriously check each one to see if you should change it. If you use a manifest constant, you only change the value associated with the name and the new value is automatically used in all the places where the name occurs. const qualified variables provide a solution in some contexts, but there are problems. A const variable always allocates storage. This is usually unnecessary and may waste space if there are many constants. Also, a const variable cannot always be used in place of a true constant. The preprocessor can be used to get around this.

An important requirement in many applications is speed and code can often be speeded up considerably by avoiding function call overheads. The solution is to this is to create some sort of function which expands to inline code. Some languages allow this by applying a keyword (such as inline) to a normal function. In C, we can only do this by using the preprocessor.

Sometimes, it is necessary to change the code depending on certain conditions. Examples are compiling for different environments or building debug or production versions of code. It is useful to be able to include code for all possibilities and then select which code will be used at compile time. This is called conditional compilation.

These are the major requirements which cannot be adequately met with just the core language definition. In C, they are met using preprocessor facilities. There are a few other, minor tasks which the preprocessor can do, which I will discuss later.

So how is it done?

As a programmer, you control the actions of the preprocessor by including preprocessor directives in the source file. These are commands which the preprocessor (and only the preprocessor) knows how to execute. There are thirteen preprocessor directives, which are listed in the following table.

Table 1.

Directive	Action
#include	Include a file of text
#define	define a macro
#undef	undefine a macro
#if	Conditional test
#elif	Conditional else-if test
#else	Conditional else
#ifdef	Conditional test - macro is defined
#ifndef	Conditional test - macro is not defined
#endif	End of conditional test
#error	Abort compilation
#line	Set line number and file name
#pragma	Implementation specific command

Each of these directives must be entered so that the '#' is the first non-blank character on a line. Earlier compilers often required the '#' to be in the first column. ISO C allows any amount of white space (space and horizontal tab only) before the '#' or between the '#' and the directive name. This allows preprocessor directives to indented in the same way as other code to improve readability. This is particularly useful when using conditional directives.

There are also three preprocessing operators, defined, # and ##.

Table 2.

Operator	Action
defined	evaluates true if macro is defined
#	stringising
##	token pasting

Now lets look at how these are used.

Defining Macros

A macro is often used to mean a sequence of instructions to be executed, as in shell scripts or DOS batch files. C macros are rather different. A C macro is a form of programmable text substitution. There are two forms of macro, object-like macros and function-like macros. Where it is necessary to make a clear distinction between them, I will refer to them as o-macros and f-macros. The simpler form of macro is the o-macro. This has the syntax:

#define  identifier replacement-list new-line

The identifier defines the name of the macro. It must conform to the usual rules for naming C identifiers, so a macro name can look just like a variable or function name. The replacement list is all the text following the identifier (including spaces), except the white space between identifier and replacement-list is discarded, as is any trailing white space at the end of the line. Once the preprocessor has processed this line, it will look for the macro name anywhere in the C code except inside string constants. If it finds one, it will replace the name by all the text in the replacement list.

For examples, consider the macros directive:

Once the pre-processor has seen the first macro definition, it will change any (non-string) occurrence of MAX_PORTS to 4. So if we had the following program fragment in our source file,

void func(void) {
  for(i=0; i<=MAX_PORTS; I++)
    init_port(i);
  puts("MAX_PORTS\n");
}

it would be changed by the preprocessor to

void func(void) {
  for(i=0; i<=10; i++)    
  // that MAX_PORTS was changed
    printf(("num is %d\n", I);
  puts("MAX_PORTS\n");   
// that one wasn't it's in a string
}

If you have been reading attentively, you should realise by now that this is using an o-macro to provide a manifest constant. (There are other applications, as we shall see in a minute.) Although this example uses an integer, this is not the only way it can be used. You can just as easily create manifest constants for other types of constant. For example,

#define  DEFAULT_DIR "/usr/bin"

and

#define  EPSILON  1.2E-10

create manifest constants for a string and a real number respectively.

O-macros can be used for purposes other than just manifest constants. They can be used for any straight text substitution. A typical example which you may sometimes see is something like:

#define  LOOP     while(1)

This is then used in the code as, say:

LOOP
{
 /*  continuous process */
}

Another example that I have seen is to create aliases for complex variable expressions. I worked on a package that used complex communications message formats. The structure into which the data was stored had several levels of nested structs and unions. If the element names were used in full, you ended up with long lines of code full of names like Q931.Infomation.Codeset6.UserID.String. To make the code more readable (and more man-ageable), these were aliased using o-macros such as:

#define  Msg_Username Q931.Infomation.Codeset6.UserID.String

Now the code could use the much shorter (and more meaningful) alias name to refer to the data object.

So far, I have only given examples of replacement lists which are a single token. They can, of course, be much more complex than this. They can even include references to other macros. Say, for example, that we which to create a manifest constant that yields a number which is ten bigger than a default buffer size, BUF_SIZE. Somewhere in the program is a definition

#define  BUF_SIZE  100

At a point after this definition, we can write

#define  NEW_BUF  BUF_SIZE + 2

When the preprocessor encounters NEW_BUF, it will simply replace it with the replacement list text 'BUF_SIZE + 2'. After the substitution, the preprocessor rescans the line looking for more macros. It now sees the macro BUF_SIZE, and replaces it with its replacement list, 100. The next rescan sees no more macro names, so makes no more changes. If we had started with the line

int  buf[NEW_BUF];

then the final result in the translation unit would be

int buf[100 + 2];

Now this example would work as expected, but there is a potential problem. What if you wrote a declaration

int  buf[NEW_BUF * 2];

Since you expect NEW_BUF to be 102, the expected answer is that the array will have 204 elements, but it doesn't. Remember, these are not true manifest constants, simply textural replacements. What you end up with is

int buf[100 + 2 * 2];

which evaluates to 104, not 204. This a very common precedence problem with this type of complex expression in replacement lists. To avoid such problems, you must adopt the practice of ALWAYS putting the expression in parenthesis. This then ensures that the expressions are evaluated as you would expect. The NEW_BUF definition should be written as:

#define  NEW_BUF  (BUF_SIZE + 2)

Although the examples I have given so far contain complete expressions, this is not necessary. A macro replacement list can contain just about any text, including no text. A macro with no replacement list is called an empty macro and has several uses. We will see uses for empty macros when we look at conditional compilation. The only restriction on the replacement list text is that it must consist of valid preprocessor tokens (pp-tokens). The set of valid pp-tokens includes identifiers, character constants, string literals, pp-numbers (integers and floating numbers), operators and punctuators. What you cannot have is a partial token,

#define  STRING  "text string"

is acceptable,

#define STRING1  "text
#define STRING2   string"

are not, even if they would appear to create a valid after processing. e.g.

puts( STRING1 STRING2);

would appear to give a valid line of code after substitution, but it is not allowed. Each replacement list must contain complete tokens. They do not, however, have to be complete expressions.

Notes:

More fields may be available via dynamicdata ..