Journal Articles

CVu Journal Vol 10, #6 - Sep 1998

Browse in :

All > Journals > CVu > 106 (12)

Note: when you create a new publication type, the articles module will automatically use the templates user-display-[publicationtype].xt and user-summary-[publicationtype].xt. If those templates do not exist when you try to preview or display a new article, you'll get this warning :-) Please place your own templates in themes/yourtheme/modules/articles . The templates will get the extension .xt there.

Title: A Critique of Some Code

Author: Administrator

Date: 03 September 1998 13:15:27 +01:00 or Thu, 03 September 1998 13:15:27 +01:00

Summary:

Body:

The sidebar contains a letter from Spencer Davies <spencer@mettav.demon.co.uk> that was attached to a program that he submitted for comment. He also asks a number of other questions. I know that Francis would want any responses copied to him for publication in future issues of C Vu. I am not going to look at all of Spencer's code in this issue as even short programs take up a lot of space and raise many issues. I know that Spencer has specifically agreed for his name to be published. That way he gets the chance of quick responses from other members.

Francis,

I think that it's about time that I asked both your own and the groups advice on programming style, approach and idiom with regards to C.

I have been learning the C programming language on and off for some time. This has been mainly in my own time, with the view to using the knowledge gained for my own enjoyment and within my working environment.

On a couple of occasions I have done a bit of programming in the work environment (please note we are not actually a programming group but are rather involved in systems integration) only to be told by the group leader that my code is a mess which makes his head swim. He ends up re-writing it. I accept his appraisal of my work, however he seems unable to give me an explanation as to what the precise problems are with my coding approach. (At least an explanation I can understand which doesn't send me off into a glazed eye state!) That is very disheartening and annoying. I believe part of my problem is that I have never actually learnt how to program (I started off with a BASIC manual which came with my first computer in 1982 and have been going down hill since! )

Apart from a couple of courses, which have tended to teach just the structures of the languages concerned, everything I have learnt has been picked up from manuals and the like. Can anyone recommend an approach which will possibly "unteach" any bad habits I have picked up and in their place "retrofit" a more reasonable approach to the task of program design and implementation (or something along those lines).

I have enclosed my latest private programming attempt, which apart from the 32bit checksum code is all my own work. This is to provide an overview of my programming style which people could possibly critique and provide recommendations as to where I am going wrong and how to improve matters. Program design was no more than a rough outline of what I wanted to do and how I foresaw doing it. The program was initially written under Lattice C 5.6 on an Atari TT then ported across to Watcom C 10.6 on the PC so the formatting of comments, etc. might seem a bit weird.

Lastly, everyone is constantly going on about both "object orientated programming" and C++, to which procedural programming and C seem to be taking a back seat. With no knowledge of such approaches and again a glassy eyed look when ever anyone mentions the likes of "encapsulation" or "operator overloading", would it be worthwhile starting again and learning this approach/language or should I stay with C until my more fundamental problems are resolved. If it is worth tackling C++ and object orientation where do I start?

Thanks for any help, advice or just plain moans.

Spencer Davies <spencer@mettav.demon.co.uk>

Relearning

One of the problems with ridding ourselves of old habits that we have acquired through the seat of our pants is that they will be based on what we find natural. In other words our instincts are at fault. This makes it very hard to change.

Negative criticism of the form Spenser has effectively received does little to help. I have a lot of sympathy for him in that he has learnt the way many of the old timers had to.

Unfortunately, much of the code that you see published is too poor to use as a model for your own.

Some Comments

Worse, comments (an essential ingredient in making code easily understood) in most textbooks are completely inappropriate. They are focused on helping a novice understand the syntax and semantics of the language. Comments are supposed to help you understand how the code solves the problem and to highlight dangerous or arcane lines of source code.

Much production code is inappropriately commented because the comments belong to an earlier version, are about the language or have been added to keep some manager happy.

A few well-chosen comments are worth the time that it took to write them. Unnecessary comments simply serve to confuse.

Global Identifiers

Before I go on to examine some of Spencer's source code in detail I think it is about time that I tried to explain a couple of problems with using global identifiers.

Some things, such as functions in C, have to use names (identifiers) that are either at file scope (visible throughout the file) or at global scope (visible to the linker). Every time you declare some identifier at global scope you introduce one more place where the linker can become confused. In C the compiler creates a single list of global names for everything other than tags (used by struct, union and enum). It has no mechanism for telling the linker how the name is being used. We can help by qualifying all names used at file scope (that is outside prototypes, definitions of struct/union types and the bodies of functions) as static unless we specifically want to use the objects so named in another file.

Your first reaction is probably to assume that such qualification ensures that reused names will not cause problems. That is not quite true because such a reused name may conceal a global name that should have been used. For example:

static void fn(void){
/* do something */
}
/* lots of code */

Now when some maintenance programmer uses fn() somewhere in that 'lots of code' the compiler will use the local definition even if the maintenance programmer intended the use of some other fn() defined in another file. As C allows implicit declar-ation of functions (and too many programmers are too lazy to over-rule that by providing explicit prototypes) the compiler will happily compile the code and the linker will never be able to issue a redefinition error.

The point I am trying to make is that global names are a precious resource and should not be frittered away without good reason. Generally variables (as opposed to function names) should not be visible globally, and even at file scope there needs to be a good reason for using them.

Another problem with global names is that they produce serious debugging problems because they are accessible throughout the program. This means that they can be changed almost anywhere. (Be very wary of using the address of a file scope variable because that too means that the value contained may be changed somewhere else).

The final problem with identifiers is caused by the pre-processor (both a great blessing and a horrible curse). C does not provide any mechanism for distinguishing pre-processor identifiers from compile time identifiers. They use exactly the same character set and generally follow the same constraints.

This leads to a coding guideline that I believe should be immutable:

Pre-processor identifiers should NEVER include a lower case letter. All other identifiers should include at least one lower case letter.

I am absolutely certain that the multitude of authors that use entirely upper case identifiers for manifest constants provided via an enum or via a const variable do not understand why the traditional #define route used upper case. If I come across FALSE' in someone's code I should be certain that somewhere there is a pre-processor directive:

#define FALSE 0

If I come across false' I should expect to find either:

int const false = 0;

enum {false};

This separation between pre-processor identifiers and all others is absolutely essential. It should supersede all other guidelines about identifiers. Unfortunately, the C Standard Library breaks this guideline. We have to live with that, but we do not have to like it nor do we have to emulate it.

Time to look at some code.

#include <stdio.h>
#include <stdlib.h>
unsigned long crc32_table[256] = { /* lookup table */
0x00000000, 0x77073096, 0xEE0E612C, 0x990951BA, 0x076DC419, 
/* 251 further values, omitted by the Harpist to save space */
};
unsigned long crc32(char filein[]) {
  FILE *in;             /* input file */
  unsigned long crc;    /* CRC value */
  unsigned char *buf;   /* pointer to the input buffer */
  size_t i, j;        /* buffer positions*/
  int k;              /* generic integer */
  /* open file */
  if((in = fopen(filein, "rb")) == NULL){
      fprintf(stdout, "CRC32: Can't open file %s\n", filein);
      return -1;
  }
  /* allocate buffer */
  if((buf = (unsigned char *) malloc(32766)) == NULL){
    fprintf(stdout, "CRC32: Can't allocate memory!\n");
    fclose(in);
    return -2;
  }
  crc = 0xFFFFFFFF; /* preconditioning sets non zero value */
  /* loop through the file and calculate CRC */
  while(1){
      i=fread(buf, 1, 32766, in);
      if(i==0) break;
      for(j=0; j < i; j++){
        k=(crc ^ buf[j]) & 0x000000FFL;
        crc=((crc >> 8) & 0x00FFFFFFL) ^ crc32_table[k];
      }
  }
  crc=~crc; /* postconditioning */
  free(buf);
  fclose(in);
  return crc;
}

I do not know whether Spencer wrote this code or if he simply copied it from a book of algorithms. If it was the latter it only goes to show how bad much published code is.

What do you think of the comments? What about the variable names? Rather than telling me that crc_table is a look-up table (surely obvious) would it not have been more helpful to at least either explain the significance of the entries or to give a reference to somewhere that describes the algorithm?

If it is a look-up table (it is) why has it not been qualified as const? If it is only used in this file (it is) why has it not been qualified as static? If it is only used by crc32, why hasn't it been declared as a static local variable in crc32()? Finally, why has it been laboriously initialised in the file? Suppose that some other table of values was wanted, this mechanism requires recompilation. Try the following:

unsigned long compute_crc32(char const * infile_name) {
  static unsigned long * crc32_table =  init_crc32_table();
/* rest of function */

Note that even though init_crc32_table() will only be called once during execution of the program I still make it a free-standing function. I do not want to clutter up compute_crc32() with details of initialising the look-up table. I cannot emphasise too strongly how important such extraction of source code is. Also, note the change of name of the function. Another small issue but helpful. Function names should look like verbs because that is the way they are used. crc32 should be a value because that is the way we would read it.

I have changed the parameter both because I want to const qualify it (string literals will be arrays of const char in future) and because it is a C idiom. Don't just do things differently because you think it might be better; that makes it harder for others to read your code. Finally I changed the parameter name to make it more descriptive.

Next look at how the file is being processed. It looks as if the programmer does not know that files are already buffered in C. Indeed the buffering should have been chosen by the implementers to provide efficient access. I am not saying that the programmer's mechanism is positively wrong but at the very least there should be a comment explaining it. Adding your own buffering adds places where things can go wrong. Until I had a very good reason to do otherwise I would write something like:

  FILE *input_file;
  unsigned long crc32 = 0xFFFFFFFF; /* preconditioning sets non zero value / 
  /* open file */
  input_file = fopen(infile_name, "rb")
  if(!input_file){
      fprintf(stdout, "CRC32: Can't open file %s\n", infile_name);
      return file_failure;
  }
    /* loop through the file and calculate CRC */
  {
    int next_byte = getc(input_file);
    while(next_byte != EOF){
      next_byte =(crc 32^ next_byte) & 0x000000FFL;
      crc32=((crc 32>> 8) & 0x00FFFFFFL) ^ crc32_table[next_byte];
      next_byte = getc(input_file);
    }
  }
  crc32=~crc32; /* postconditioning */
  fclose(input_file);
  return crc32;
}

Now please note that there were some errors in the original code. The two error returns propose to return a negative value through an unsigned long. A little thought reveals that unless you can guarantee that one or more of the values in the unsigned long range can never be created by the algorithm on which this function is based then there are no available values for error return. We cannot even fix the problem the way that getc() does (by returning an int, when all the possible non-error values only use an unsigned char) because we have nothing larger than an unsigned long (many systems provide long long but that is not standard at least until C9X is released) to provide surplus values.

We have a number of possible solutions:

We can have an extra parameter that is used to return an error status.
We can return the CRC in a parameter and use the return value to signal success/failure
We can raise a signal
We can use a global variable to signal an error (like the use of errno by the standard library)

The last of these is dangerous for two reasons:

too many programmers do not bother to check
it does not work in multi-threaded environments.

The penultimate relies on using signal handlers and that would seem to be something of overkill for this kind of problem.

Of the first two methods, I have a strong preference for the second one because it is easy to use and therefore stands a good chance of being used.

Before I provide my final version there are two other points that need attention. I do not understand why the function returns the complement. That is the kind of line that deserves a comment. As I do not know the answer, I cannot supply that comment. The second issue is that buried in the implementation is the implicit requirement that the look-up table contains 256 elements. We would be advised to make this more explicit in our code.

int compute_crc32(char const * infile_name, unsigned long * crc32) {
  static unsigned long  crc32_table[256]; /* this algorithm requires a look-up table of 256 elements */
  static int initialised = false;
  FILE *input_file = fopen(infile_name, "rb");
  if(!initialised ) {
     init_crc32_table(crc32_table); initialised = true;
  }
  /* check file opened OK */
  if(!input_file){
      fprintf(stdout, "CRC32: Can't open file %s\n", infile_name);
      return false; /* failed */
  }
  *crc32 = 0xFFFFFFFF; /* preconditioning sets non zero value */ 
    /* loop through the file and calculate CRC */
  {
    int next_byte = getc(input_file);
    while(next_byte != EOF){
      next_byte =(*crc 32^ next_byte) & 0x000000FFL;
      *crc32=((*crc 32>> 8) & 0x00FFFFFFL) ^ crc32_table[next_byte];
      next_byte = getc(input_file);
    }
  }
  *crc32=~*crc32; /* postconditioning */
  fclose(input_file);
  return true; /* success */
}

I am assuming that such values as true', false' have been provided via an enum. I have left the writing of init_crc32_table() to the reader because there are several choices available. The values could be hard coded into that function but they are more likely to be read from a file. The name of that file might be hard-coded or it might be provided by an environment variable or by some other means. The point is that the mechanism by which the table of values is provided has nothing to do with the implementation of the algorithm and so should be kept separate.

Note that the absence of const qualification to the (new) second parameter signifies that it is for writing back a value. These small details make it easier to understand the intent of the programmer.

I think I have space and time to look at one more function from Spencer's code. So here it is. Try reading it first and see if you can follow it. If you cannot, can you explain what the difficulty is? I have some ideas but try to form your own to start with.

/* Pass either argv[0] or a filename as the progname and a char pointer to act
 as the ini_file buffer.  Note: Only parses lines which begin with a character greater than '#' i.e.
an alphanumeric.  Also the list of statements and values are seperated by either '=',' ' or ','
and each entry is seperated from the next by '\0'.  The last entry is designated by "\0\0". */
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include "ini_proc.h"
extern char *ini_buffer;
int ini_proc(char ini_name[]) {
  char *chr_ptr;
  char module_name[]="INI_PROC";
  char inifilename[14] = {'\0'};
  long int file_sz;
  char linein[80];
  char *statement;
  char *value;
  FILE *inifile;
  /* Construct INI filename from program name */
  if(strlen(ini_name) > 14) {
    if(!(chr_ptr = strrchr(ini_name, '\\'))) { /* Get last filename from path */
      fprintf(stdout, "%s: Invalid filename to process.\n", module_name);
      return -4;
    }
    ++chr_ptr;
  } else if(strlen(ini_name) < 2) {
    fprintf(stdout, "%s: Invalid filename to process.\n", module_name);
    return -3;
  } else {
    chr_ptr = ini_name;
  }
  strcpy(inifilename, chr_ptr);
  if(chr_ptr = strtok(inifilename, ".")) {
    chr_ptr = chr_ptr + (strlen(chr_ptr)+1);
  }
  strcat(inifilename, ".ini");
  /* Open and process inifile */
  if((inifile = fopen(inifilename, "r")) == NULL) {
    fprintf(stdout, "%s: Unable to open %s.\n", module_name, inifilename);
    return -1;
  }
  /* Allocate suitably sized block of memory for ini_file contents */
  file_sz = filesize(inifile);
  if(!(ini_buffer=(char *)calloc(file_sz, sizeof(char)))) {
    fprintf(stdout, "%s: Failed to allocate %ld bytes for ini file buffer.\n", module_name, file_sz);
    return -2;
  }
  /* Now read ini file statements into block */
  chr_ptr = ini_buffer;
  while(fgets(linein, 80, inifile) != NULL ) {
    /* if line starts with # or less then ignore */
    if(linein[0] > 35) {
      statement = strupr(strtok(linein, "= ,"));
      value = strtok(NULL, "#;");
      /* add new entry */
      strcpy(chr_ptr, statement);
      /* go beyond the current entry */
      chr_ptr = chr_ptr + (strlen(chr_ptr)+1);
      /* add new entry */
      strcpy(chr_ptr, value);
      while((chr_ptr[(strlen(chr_ptr)-1)] < 33) && (chr_ptr[(strlen(chr_ptr)-1)] > 0)) {
        chr_ptr[(strlen(chr_ptr)-1)] = '\0';
      }
      /* go beyond the current entry */
      chr_ptr = chr_ptr + (strlen(chr_ptr)+1);
    }
    /* add second null for end of statement/value entry */
    *chr_ptr = '\0';
  } ++chr_ptr;
  /* Now set last item entry */
  strcpy(chr_ptr, "lAsTeNtRy");
  chr_ptr = chr_ptr + (strlen(chr_ptr)+1);
  strcpy(chr_ptr, "lAsTeNtRy");
  chr_ptr = chr_ptr + (strlen(chr_ptr)+1);
  fclose(inifile);
  return 0;
}

OK, the first thing that scares me with this code is seeing that extern char * out in global space. At the very least it can do with a better name and a comment telling me what it is for and which file initialises it because as it stands we have a pointer floating around with no idea what it is pointing at.

At this stage I need a justification for a global variable, and doubly so if it is a pointer. This is one of the places where I want to see a comment. I would also like a more informative name for the function and some brief explanation as to what it does. I suspect that it is intended to initialise some process. Fine, why not call it initialise_process(). Even better add something that tells the reader what process is being initialised. I know that strictly conforming code has a very short external identifier length (six characters, case insignificant) but real world compilers are far more generous these days.

Now look at the parameter name, does that suggest a filename possibly proceeded by a path? While you are trying to answer that question notice the return statements with magic numbers. Let us fix that one while we are about it (and note that the exact values are irrelevant, all we want is to be able to distinguish between different failure modes. Putting the following in the relevant header file should help:

enum errors {success, name_too_long, name_missing, file_missing, insufficient_RAM};

Now back to that parameter. There is a comment that talks about last filename on the path. Now that does not make sense. Are we to suppose that the parameter passes a path from which we have to extract the file name that terminates it? In which case what is the significance of 14? (Take a guess, this is a command line parameter which starts with a backslash and refers to a file that is in standard MSDOS 8.3 format and we need a terminator. No that won't work because strlen() will ignore the null terminator. You see what I mean? We need better identifiers and some explanation about what is being done.)

Let us try to reorganise the checks so that they make some sense (note that long list of local variables riddled with uninitialised pointers is worrying me - bound to be a source of errors and maintenance problems). Surely we must first check that the argument passed is not a null-pointer and points to a string long enough to be some use. That gives me:

if (!ini_name) return name_missing;
if(strlen(ini_name<2)) return name_missing;

I think that we should next pull off whatever comes after the last backslash if there is one. That results in something like:

if((chr_ptr = strrchr(ini_name, '\\'))) chr_ptr++;
else chr_ptr = ini_name;

At this stage chr_ptr points to everything after the last backslash, or everything if there is no backslash. Now we must check that we still have a non-null string with:

if(!chr_ptr[0]) return name_missing; /* check first character is not null */

Now reading the code it seems that we must add a .ini extension but look at this piece of code from the above:

  strcpy(inifilename, chr_ptr);
  if(chr_ptr = strtok(inifilename, ".")) {
    chr_ptr = chr_ptr + (strlen(chr_ptr)+1);
  }
  strcat(inifilename, ".ini");

I can see what the first line does, and I am not ready for that yet but the next two lines baffle me. strtok will return the address of the first character after a dot if there is one else it returns null. So if it finds a dot we set chr_ptr to point to the null terminator after that dot and then make no use of it because the next time chr_ptr occurs it is on the left of an assignment. Now I think that the intent is to add the extension .ini if the tentative filename does not have an extension. But as written it will over-write any extension with .ini. The trouble is that we must devise a mechanism that will correctly detect filenames that are too long. The first step is to copy nine characters from where chr_ptr is pointing (ignore the block of code above, it is pointing to the start of the putative filename) into inifilename.

strncpy(inifilename, chr_ptr, 9);
strtok(inifilename, ".");  /* replace the first dot by null if there is one */
if(strlen(inifilename)>8) return name_too_long;
strcat(inifilename, '.ini');

I am not entirely happy with this because it forcibly makes the extension .ini. More reasonable would be to make that the default extension. The simplest way is to replace the last line above by:

strncpy(inifilename, chr_ptr, 13);
if(inifilename[12]) return name_too_long;   /* strncpy fills end of string with nulls */
if (!strrchr(inifilename, '.')) strcat(inifilename, '.ini');

By testing the thirteenth character in inifilename after the second use of strncpy we can ensure that fewer than 13 characters were copied. Actually, I still have not tested any existing extension to see if it is more than three characters. I am leaving that as an exercise for the reader. By the way, at this stage you should not be concerned about code speed because this code only runs once. Even if it is terribly inefficient it will have no impact on the overall performance of your program.

Now we are ready to open the file. I abhor unnecessary complexity in statements so I would write:

inifile = fopen(inifilename, "r");
if(!inifile) return file_missing;

I suppose it is about time I explained why I am not providing error messages. This is the wrong place for them. It is for the user of the function to decide what messages to print and where to print them. However if you really want to compose the messages in this function a slight change will allow you to do so. Change the return type of the function to char const *. Provide a local static array large enough for the longest message. Place the message in that and return its address. For the successful run just return the null pointer. By starting each error message with a unique digit you can even return the nature of the problem as well as an appropriate message. This only one way of solving this kind of problem. The advantage is that more information is provided in the error message, the disadvantage is that you provide more clutter in the local source code and that inevitably leads to mistakes.

Now to another of the errors hiding in this code:

  file_sz = filesize(inifile);
  if(!(ini_buffer=(char *)calloc(file_sz, sizeof(char)))) {
    fprintf(stdout, "%s: Failed to allocate %ld bytes for ini file buffer.\n", module_name, file_sz);
    return -2;
  }

Did you spot that file_sz is a long int? Yes that is what the user provided filesize() returns. But this code seems to be written for an MSDOS system. The first parameter of calloc is size_t and that is usually a typedef for an unsigned int on an MSDOS system. This code runs a severe risk of allocating too little memory if the requisite file is larger than 64k. There clearly needs to be some more work done here. By the way, sizeof(char) is always one, the C standard requires that to be the case. Actually because this is providing memory for a buffer, I would write a separate function to do it and my code would look something like:

if(!allocate_space(ini_buffer, filesize(inifile))) return insufficient_RAM;

It is difficult to comment on the rest of the code because I do not know what the writer is trying to achieve. I must admit that I am worried by the intent to read an entire file into RAM. I feel that there needs to be some kind of validation on the file size. One important thing when writing code in any language is to try to stick to the idioms of the language. For example

chr_ptr = chr_ptr + (strlen(chr_ptr)+1);

Occurs in several places. While it achieves its objective of pointing to the null terminator of your current string I would feel it more natural to write:

chr_ptr += strlen(chr_ptr) + 1;

The failure to use standard idioms is something that makes code difficult to read. The experienced programmer expects to read code and understand it. When faced with non-idiomatic usage the expert is surprised and probably assumes that the writer does not know the language (C in this instance). In natural language, failure to use local idioms reveals that you are an outsider. Outsiders are usually suspect (that is part of our inheritance as social animals). In order to learn idioms you must converse with native speakers. In order to learn program idioms you must study the code of good programmers.

Let me give you a simple example form the above that illustrates the problem of non-idiomatic code. Spencer declares his function with:

int ini_proc(char ini_name[])

Experienced programmers know that this is converted to

int ini_proc(char * ini_name)

by the compiler. Many programmers sensibly use the array format to indicate to users that the address of an array is expected. However the case of arrays of char is different. It would be quite exceptional to pass the address of a single char (so much so that a comment should be made any place where that is intended). On the other hand string literals are semantically (i.e. in their behaviour) read only and so should be const qualified. C originally chose not to require that qualification. C++ (and, in future, C) specifies that the type of a string literal is array of const char which can decay to char * (unqualified). That conversion is deprecated. Now consider:

int ini_proc(char const ini_name[])

Does that convert to:

int ini_proc(char const * ini_name)

int ini_proc(char * const ini_name)

Even if you know which, will those that read your code? By using the array form you make those reading your code think in ways that they would prefer not to. Just write what you mean in the idiomatic form that others will use. In this case that is:

int ini_proc(char  const  *ini_name)

Some readers may think that this is being pedantic. After all there is nothing wrong with what was originally written. That is true. But we also need to be understood. So, over to the readers for their comments.

Notes:

More fields may be available via dynamicdata ..