Title: Student Code Critique

Author:

Date: 06 January 2000 13:15:35 +00:00 or Thu, 06 January 2000 13:15:35 +00:00

Summary:

Body:

from Who?

Please always include your name in attached files. I do not have time to check and when I do they have long been detached from the email, or disk in which they were sent.

We have a program written by a student to fulfil the task described below. I am a student too, and I haven't had formal training in programming. Therefore I would be interested if someone could write a critique of my critique.

Please do. The purpose of this column is to provide positive help to less experienced members while giving everyone a chance to add their two cents worth.

This is our specification:

You must read in a file whose first item is the number of remaining items.
The file format seems to be that each item is separated by a space.
When you have read the file, you must create a list of the unique items.

Example input:

  10 swe dan dan nor swe swe tur swe pol dan

Example output:

  swe dan nor tur pol

The order of the items in the output is not important

3 changes have been made to the original to make the code compile on an ANSI C compiler:

   1 #include <stdio.h>
   2 #include <string.h>
   3 #include <stdlib.h>
   4 #define MAXLAND 10
   5 typedef struct { char namn[4];} test;
   6 int main(void){
   7   FILE* fil;
   8   test Ttest[MAXLAND], Ttemp[MAXLAND];
   9   int k=0,counter=0,ej_lika,m,stycken,a;
  10   fil=fopen("land.dat", "r");
  11   if (!fil){
  12     printf("ERROR open file!");
  13     exit(1);
  14   }
  15   fscanf(fil,"%d",&stycken);
  16   while (!feof(fil)){
  17     fscanf(fil,"%s",&Ttest[k].namn);
  18     k++;
  19   }
  20   fclose(fil);
  21   for (k=0;k<=MAXLAND;k++)Ttemp[k].namn==NULL;
  22 /* Can i do this different??*/
  23   for(k=0;k<=stycken-1;k++){
  24     ej_lika=0;
  25     for(m=0;m<=stycken-1;m++){
  26       if(strcmp(Ttest[k].namn,Ttemp[m].namn)!=0) ej_lika++;
  27       if(ej_lika==stycken){
  28         strcpy(Ttemp[counter].namn,Ttest[k].namn);
  29         ej_lika=0;
  30         counter++;
  31         break;
  32       }
  33     }
  34   }
  35 /* check the result. */
  36   for (k=0; k<counter; k++) printf("%s\n",Ttemp[k].namn);
  37   getchar(); return 0;
  38 }

Lines 1-3 have been added (for clarity the original omitted the #includes)

Line 22 originally read

    /' Can i do this different??*/

the ' has become a *

Line 37 originally read

    getch(); return 0;

I have used getchar instead so that <conio.h> need not be #included (this is a non-ANSI header file)

My first impression is that this code was not written by a native English speaker. The variable names ej_lika and stycken have a distinctly foreign feel. Also, the example data looks like it might refer to the names of predominantly Scandinavian countries. We should bear this in mind when discussing the student's choice of variable names.

We notice also that the program seems to have been a rush job. The original line 22 is a good demonstration of this. The program would not even compile (this shows a lack of testing too) and the spelling is dubious, with the word 'I' written in lowercase.

The student seems reluctant to use spaces in the program. One reason for this might be that they tried to type the whole thing out as fast as possible^[1]. The dangers of this approach are obvious and need not be discussed. Another reason may be that the student wants the code to be more compact.

These days, a good program is a clear program. Computers are sufficiently powerful that it is not necessary to squeeze every last millisecond out of a program, nor is it necessary to save every last bit. The argument is especially true for compiled code. All too often one sees a convoluted statement in C with hundreds of assignments, comparisons, conditional operators, and so on. Such statements are very hard to follow, prone to errors, and usually produce exactly the same machine code when compiled as do longer, clearer, statements.

Let's go through the code line by line.

In line 4 the student has set up a macro for the maximum number of items in the input file. It is good to see that some attempt has been made to have a name in the code rather than a magic number. Also the macro name is all in uppercase which follows the convention. However, there are a number of problems here. Firstly, it is usually better to avoid macros. I would have preferred to see something like

  const unsigned int maxland=10;

  enum { MAXLAND=10 };

instead^[2]. Macros modify the source itself before it is compiled which is possibly best avoided. However, in C especially, #defined constants are not actually wrong.

This line shows a misunderstanding on the part of the student. Their program will handle the example file, but little else. It doesn't actually meet the specification, which places no limits on the number of items. The name MAXLAND shows that the student is thinking in terms of the countries in the example file, and not in terms of general data items. For the example file, dynamic memory allocation seems a bit heavy-handed, but the specification allows for files of arbitrary length. A model answer would allocate memory dynamically.

Line 5 defines a type used to store the incoming data items. Again we see that a fixed-length array is used - and a small one at that. My personal style would be to write something like:

  char namn[3+1];

to indicate that namn will be a string. However this is a matter of taste.

Using a structure in this way seems a bit odd - why not just use a multi-dimensional array to store the data? The choice of identifiers is peculiar too. Why namn instead of name? What has test got to do with anything? I like to define my type identifiers with an initial capital and use CamelNotation and my variables lowercase_underscored in order to distinguish between them. This is often useful when the same word seems natural for both a type and variable: e.g. Date date^[3]. Really though what particular style one uses is not so important as applying a style consistently - and I cannot see much evidence of this in the code.

The void in line 6 is superfluous, but not actually harmful. I prefer the style where the brace comes on its own on the next line, but there is nothing wrong with the other way^[4].

I have some reservations about the use of fil as a name on line 7. Typing 'file' seems more natural to me. Perhaps better still would be a name like 'input' or 'in' that gives a clue about what the file is. However, in a short and simple program this is less important. The variables declared on line 9 will be discussed when they are used in the code.

Line 11 is OK, but the use of ! suggests that fil is some sort of flag or boolean value. I would prefer

  if (fil==NULL){

because that is really what we are testing for^[5]. They compile to pretty much the same code anyway.

Errors should really go to stderr, not stdout, so the use of a simple printf in line 12 is questionable. Further, puts is a better function to use when one is outputting a plain string because it is simpler. One should always use the simplest means to get the job done. The printf class of functions are capable of failing and returning errors, which in an ideal world should be taken into account.

However, because our next action is to exit we don't need to bother here. We wouldn't do anything different if an error did occur. Another gripe might be that the error message that is produced is not very elegant. I shall put this down to the fact that the student may not be overly familiar with English.

Line 14 raises an interesting point. Ordinarily one should avoid the use of fscanf because it is awkward to control how much input will be read. However, in this particular case it should be all right because we a loading a value to an integer, not a string or block of memory. What is more worrying is that although the program stores the number of data items read, it does not use this in the subsequent input routine. Specifically, if there are more than 'MAXLAND' data items, we can start overwriting bits of memory that we don't own.

The while-loop on lines 16-19 is perhaps not the best approach to the problem. A for-loop might conform more closely to the C idioms. If a while must be used, I would like to see k set to 0 just before it. This would make it more obvious to the reader that k had been initialised, and if some code using k was inserted before the while-loop, the loop would still work. The above made point about fscanf applies here - fgets might have been better, or some such other custom input routine. My earlier point about a multi-dimensional array being more appropriate than a struct is re-enforced: the clumsy Ttest[k].namn could be replaced with a simple Ttest[k]. Lastly, it might not be a bad idea to check the fil-stream for errors as well as end-of-file. I am afraid that I am not sure if an error would set the eof flag or not.

Line 21 could be replaced with

 k=MAXLAND+1;

which would be equally pointless. Perhaps the aim was to initialise all of the elements of Ttemp? Such a move is not necessary as the stycken variable is later used to determine how much input was read, and the counter variable records how many elements of Ttemp have been used.

At this stage we have read in the input. The identification of unique items is a separate task, and should therefore be in a separate function. Recently I was working with someone who had just taken a vocational course in programming, and I was horrified to discover that they had not been taught about functions. Perhaps the student here is in the same boat. At the very least, a blank line and perhaps a (meaningful) comment should separate the two phases.

The names ej_lika and stycken did not convey any meaning to me at all. This made it surprisingly hard to figure out what algorithm the student had used to identify unique items. Also using k and then m as simple loop counters is a bit unconventional - i and j would have been more natural. Both in line 23 and line 25 the clumsy

 <=stycken-1

should have been

  <stycken

which is both clearer and less computationally expensive.

Essentially what the student does is take each loaded data item in turn, and compare it with all of the elements stored in the Ttemp array. If no matches are found, the loaded item is copied into the Ttemp array. The obvious flaw in this is that Ttest is often compared with uninitialised elements of Ttemp. Usually this will work, but there is a small chance that the junk in Ttemp might correspond to an item read in from the file. Further, this approach makes many more tests than are necessary: even if there is only 1 item in Ttemp, 10 comparisons are still made.

The if-statement on lines 27-32 should be moved outside of the for-loop starting on line 25. The condition it tests for will only occur at the end of the loop, so performing the test many times inside the loop is just wasted time. Lines 29 and 31 are altogether superfluous.

What would be a better way of approaching the problem? Searching through each item like the student has works fairly well for small amounts of data, but the specification allows for arbitrarily long input. Another approach - which has the advantage of using much less memory - might be to sort the input and then step through it, printing any item that was different from the previous one. For very large input indeed, some sort of hash table would probably be best because looking up data is almost an O(1) operation. I have decided to use a hash table in my model answer. This is quite a complicated solution to a seemingly simple problem.

However, when nuts can come in arbitrary size, sometimes cracking them with a sledgehammer is appropriate. Further, although writing hash-handling functions and input routines will make this project much longer than it could be, these routines can be re-used in later projects - actually saving time in the long run.

Actual experience suggests this will not be the case. Before investing time in a more general solution you should have a reasonable expectation that you will need such. Writing reusable code is expensive so do not do it on spec.

Critique from James Holland

On compiling the student's program, two warnings are issued. The first warning states that the code Ttemp[k].namn == NULL; has no effect, the second warning states that a is declared but never used.

The first warning results because the student has used the wrong operator. The == operator is used to compare two values. As written, the compiler compares Ttemp[k].namn with NULL and simply throws the result away. What is required is the assignment operator, =. I am sure the student knows this and that the error was just a slip. The second warning is resolved by simply deleting the declaration of a.

Make sure your compiler has all warning options switched on. Do not ignore warnings as they provide clues as to faulty or poor quality code.

When the program is recompiled an error is issued. We seem to be going from bad to worse. But this is not so as a more fundamental problem has been revealed. The compiler states that an lvalue is required for the statement Ttemp[k].namn = NULL;. An lvalue is a storage location, Ttemp[k].namn is not a storage location, it is the address of a storage location. There are two ways to correct this, one is to fully specify the location we wish to NULL, e.g., Ttemp[k].namn[0] = NULL;, or by dereferencing the address of the location we wish to NULL, e.g., *Ttemp[k].namn = NULL;. Either statements will do, they both have the same effect and, more than likely, generate exactly the same code. Ttemp[k].namn[0] = NULL; is probably more straight forward and easier to understand^[6].

Let us recap on what we have been trying to do. We are trying to NULL C strings. Each element of the Ttemp array contains a string called namn. C strings are just arrays of chars, so in order to NULL namn it is only necessary to NULL the first element of namn. The other three elements of the char string namn can be left as they are. There is no need to use the macro NULL, a simple 0 will do^[7].

We now have a program that compiles with no errors or warnings. There are, however, some other matters we should attend to before running the program. The for-loop that NULLs the char array of Ttemp iterates once too often and will write beyond the end of the array. The for loop should be for(k = 0; k<MAXLAND; k++).

We have now sorted out the for-loop for nulling the Ttemp array. But it turns out that the for-loop is not necessary after all. All we need to do is initialise the Ttemp array in the declaration statement, e.g., test Ttemp[MAXLAND] = {{{0}}};. But why do we need all these braces? In short, it is because we are initialising an array of structs containing an array of chars. We have only provided the initial value for the first element of the char array namn within the first element of array Ttemp. The compiler will set the remaining elements of an array to zero when a partial initialisation list is provided. We could leave out two pairs of braces but the compiler is likely to issue a warning about the initialisation be partially bracketed. It is best not to give the compiler the excuse to complain. The program can now be run. It seems to provide the desired results. There are, however, some further improvements that can be made.

The code for finding duplicated strings seems to work perfectly well. It does, however, seem too complicated for what it is doing. The conditional expressions of the two for loops can be changed to k < stycken and m < stycken respectively, although this is only a minor point.

The inner loop continues to scan to the end of Ttemp irrespective of having found a match. The break statement has no effect, as the inner loop will exit when ej_lika equals stycken. The second if-statement is evaluated every iteration of the inner loop but the body of the if-statement is only executed when the loop is about to exit. This is not very efficient. With some rearrangement we can improve things somewhat.

First, we can arrange for the inner loop to exit as soon as a match is found, but before it does it sets a flag to indicate a match. Then, outside the inner loop but within the outer loop we can decide whether we need to copy the string in question to the Ttemp array. Note that the modified code has a break statement within the inner loop but this time it is used to exit the for-loop prematurely, if a match is found. These modifications will not make a great improvement to the efficiency of the program but they will make it easier to understand. The program now works in a more natural way, at least to my way of thinking.

There is a redundant & (address of) in the statement fscanf(fil, "%s", &Ttest[k].namn);. The & is best removed as it can be confusing, Ttest[k].namn is already an address.

There are some general points to be noted regarding the student's program. There is no protection against loading from file more country names than the program can accommodate. The array Ttest will overflow if more than ten names are read from the file. Also, if the length of a name is greater than three characters the char array namn will overflow.

There is no check to confirm that the first number read from the file is the same as the number of names in the file. It would be better not to have the number of countries stored in the file at all. The number of countries could be calculated as they are read from the file. The arrays would still have to be large enough for the expected number of countries, though. In fact, the whole file does not need to be held in memory. The country names can be compared with the names in Ttemp as they are read from file. There would be no need for the Ttest array.

Also, there is no need to have structs that contain just one member element. It would be simpler to dispense with the structure and declare Ttest as an array of char arrays, namely, char Ttest[10][4];. The address of a particular string would then be simply Ttest[k] as opposed to Ttest[k].namn.

I present a corrected version of the student's program in file student\land.c.

My version for the program is based on similar lines as the student's . I have tried to make it somewhat more robust and to give more information to the user when things go wrong.

The program starts by attempting to open the data file. A warning message is issued if there is a failure to open the file properly. If all goes well the main loop of the program is then entered.

The fscanf function limits the number of characters read to three. This ensures that the char arrays, country and output_list, do not overflow.

I have also guarded against trying to write too many strings to the output_list array by ensuring the number of different file names does not exceed the size of the array storing them. If the array is full and a name needs to be added, a flag (resources_exhausted) is set and no attempt is made to add the name to the array. The while loop will then exit and an appropriate message warning displayed. The while loop is also terminated when the end of the data file is reached.

I have combined the code that reads the strings from the file and the code that removes duplicated strings into a common loop. This has the advantage that it removes the need to store the entire file in memory. The country names are compared with those in the output list as soon as they are read from the file. There is no separate loop to display the unique file names, this code is combined into the main while loop as well.

Finally, the data file is closed and the resource exhaustion message displayed, if required.

I provide my solution to the problem in file mine\land.c. I have included more comments than I would normally in the hope that it will assist the novice.

Both files of code will be placed on the ACCU web site.

From Mike O'Shea `<mike.oshea@tinyonline.co.uk>`

I am not a professional programmer (in the sense that I am not paid to program) and am largely self taught. That means my knowledge is rather limited, and that my code is probably very suitable for criticism. I feel I have learnt a lot from carrying out this exercise, and it is the first time I have done anything like this^[8].

When I looked at the code, I came up with the following observations:

I would have defined the file name as a constant string at the beginning of file so it could be changed easily, or you could allow for it to be entered either on the command line or at a prompt depending on user requirements.
The variable a is not used.
Most of the variable names don't give any clue to their purpose. I did not understand the ones that could have had a meaning (ej_lika & stycken)^[9].
I would wrap the 'open file command' directly in the if-statement, to simplify the code.
I would not use fscanf as it behaviour can be a bit unpredictable^[10].
I would use a linked list, rather than the two arrays of structures. This does not require you to know how many words you are dealing with and enables you to sort the list as you create it.
The code that is supposed to zero the Ttemp array does not, it tests for equality and does nothing with the result, a single '=' would have worked.
strcmp is case sensitive, which is all right if you want your results to reflect the case of the words.
The getch() at the end requires a "Press anykey to continue" message, which I would send to standard error, in case the output is redirected to a file.
I would use fprintf rather than printf and send the file error messages to standard error and the normal output from the file to standard out, so it can be easily redirected to a file.

I have rewritten and refined the program, the specification set down in the excise did not give you much information about what the program should do. Are the elements read from the file all three characters in length? Is the file a fixed width format file, or what is the delimiter? Was the number supposed to be on a separate line or was it just word-wrapped? What is the potential size of the file, how many elements could there be in it? How is the file created, (human or machine)? How dependable is its format, what kind of errors do we need to trap? Is the programme run as part of a batch process (unsupervised) or run from the command line.

My refinements included allowing you to specify a file name on the command line and adding a prompt for a file name if you don't enter it on the command line. From the information given in the specification I could see no reason to bother with the number of records at the beginning of the file. I have included the code to read the number, but do not use it within the program. I chose to make any non-alphabetic characters the delimiters (after I have read the numbers). My program does not need to know the number of words in the list, as a linked list is dynamic and allows for easier further manipulation, such as sorting the list. I chose to convert all the characters to lower case before I carried out any comparison and added to the list. For the sake of the exercise, I included a count of the word frequency in the structure, and output the data both to the screen and to a file. If you run it on a text file it does not cater for punctuation, such as apostrophes, but presumably the program given in the example deals with identifiers rather than words used in sentences.

I would appreciate any comments you or anyone else has on my code.

You will find Mike's code on our website (www.accu.org). I have not printed it because it is fairly long. However here are a few comments to consider:

Restrict identifiers without any lowercase letters to the pre-processor. I.e. do not use all uppercase for typedef identifiers (e.g. NODE in your code)
Think very carefully before using global variables (in this case the beginning and end of the linked list
Do not write a monolithic main but provide each piece of functionality as a distinct function (e.g., getting the file name, management of the linked list - note that you use malloc but just rely on program termination to release the resources.
As you will probably reuse such things as linked lists, write them once as a separate exercise.

And The Winner

This time I mentally tossed a coin (weighted because of the anonymity of one contributor and came up with James Holland. If James likes to contact me we can discuss what he would like for his reward.

Title: Student Code Critique

from Who?

Critique from James Holland

From Mike O'Shea `<mike.oshea@tinyonline.co.uk>`

And The Winner

Student Code Critique Competition

Sponsored by Blackwell's Bookshops & Addison-Wesley

Title: Student Code Critique

from Who?

Critique from James Holland

From Mike O'Shea <mike.oshea@tinyonline.co.uk>

And The Winner

Student Code Critique Competition

Sponsored by Blackwell's Bookshops & Addison-Wesley

From Mike O'Shea `<mike.oshea@tinyonline.co.uk>`