Title: Questions and Answers

Author:

Date: 05 December 2000 13:15:41 +00:00 or Tue, 05 December 2000 13:15:41 +00:00

Summary:

Body:

Answers

Question 6 in C Vu 12.5 created a considerable response, with a couple of long and very detailed answers. The result is almost an article in itself. Fortunately I do not have to traditional periodical rules where a fixed number of pages is allocated to regular columns and so can publish a range of alternative answers.

Answers (tentative) to Questions 1-4

from: Silas S. Brown <ssb22@cam.ac.uk>

I will try to answer Jun Woong's questions in C Vu, but I am not entirely sure about my answers.

1) If the declaration of an identifier for an object has file scope and no storage-class specifier, its linkage is external.

Yes, except that the "declaration" will then be a definition. For example:

file1.c:

int this_is_global = 3; /* definition at file scope and
      no storage-class specifier (static, extern, etc) */

file2.c:

#include <stdio.h> extern int this_is_global; int
      main() { printf("%d\n",this_is_global); return 0; }

Output: 3

2) The declaration of an identifier for a function that has block scope shall have no explicit storage-class specifier other than extern?

3) Why does the standard disallow storage-class specifiers other than extern on a function declaration inside a block?

The other storage-class specifiers (e.g. static) would not make sense. For example:

int my_function() { int your_function(); return
      your_function(); }

Here, your_function is declared at block scope (within my_function), but it must be defined elsewhere. The declaration cannot be static, since the function cannot be defined within the current scope (which is my_function).

4) Regarding arguments that are subject to macro expansion, on #pragma and #error In "C:A Reference Manual (H&S)", it says that: "The argument to #pragma is subject to macro expansion. The #error directive produces a compile-time error message that will include the argument tokens, which are subject to macro expansion." Please give me examples about these. I would like to know how the #error directive can have arguments subject to macro expansion.

I don't think this is true. I don't have the standard, but none of the compilers I tried expanded the arguments of the #error directives. The code I tried was:

#define shi4yan4 experiment #error
      shi4yan4

The compilers gave an error message like:

test.c:2: #error shi4yan4

If it had done macro expansion, it should have said "#error experiment".

Answers to Question 5

from James Curran <<James@NovelTheory.com>>

Question #5: What is the correct behaviour of

int x =10;
for (int y=0, x=0; y < 2; x++; y++) {}

Answer:

First of all, as published, that code is a syntax error, as there can only be two semi-colons in the for statement. The correct version would be:

   for (int y=0, x=0; y < 2; x++, y++) {}

which uses the comma operator to join the two parts into a single expression.

To best understand the behaviour of a for statement, you can always treat a statement like "for(A;B;C)..." as if it were written

    {
        A;
        while(B)
        {...
            C;
        }
    }

Substituting you code into that format, we get:

    int    x = 10;
    {
        int y=0, x=0;
        while (y<2)
        {
            ....
            x++, y++;
        }
    }

from that, it should be obvious that x is being defined as well as initialised inside the for statement.

[I think there is still a bit more to say on this subject. FG]

from Catriona O'Connell <<catriona38@hotmail.com>>

In C Vu 12.5, Dave Midgley (p17) asks about the scope of x in his two examples. The C++ standard in Section 6.5.3 answers this question and shows that his compiler is conforming. His explanation is correct.

Confusion might arise because of the distinction between the two uses of the comma operator. The case which actually occurs is that the int y causes the whole expression to become a block declaration, thus creating x with local scope rather than creating y with local scope and then setting x, declared outside the for structure to 0 as independent actions.

If you try to compile Dave's first example with MS Visual C++™ 6.0 (even at SP4) it will fail with a C2374 error code because it moves the initialisation outside the for-loop causing a redeclaration. The "solution" offered by Microsoft are

1. Compile with /Za - breaking most Windows code.

#define for if(0);else for

The confusion exemplifies one of my programming rules of thumb; that it is unwise to use the same variable name in different scopes within a single module. Doing so introduces a potential lack of clarity and a maintenance overhead. While scopes are clear in Dave's examples, in more complex code (spread over several pages) it might be less clear which instance is in scope.

Answers to Question 6

I think the first answer reviews the question in sufficient detail so I will not repeat the question here.

Answer from R.Butler

The Questioner has a file of statistical data in columnar format, and wishes to count the number of values on the first line of the file, assuming this to be typical, and hence determine the number of columns. The Questioner has already discovered the difficulties involved in trying to use fscanf() to do this. These problems arise because that function does not distinguish between the end-of-line character and other "white space" characters. It is therefore difficult to make it read one line from a file and then stop. For example, given a file containing only columns of integers, successive calls of fscanf(fp, "%d %d", &xValue, &yValue) will happily keep reading and assigning values until the file is exhausted. Incidentally, notice that &xValue and &yValue must be addresses, indicated by '&'.

The Questioner is confident about what will happen if you give the aforementioned function a 3-column file; nevertheless it might be worth considering an example. Given a file like this:

1 2 3
4 5 6
7 8 9

successive calls of fscanf(fp, "%d %d", &xValue, &yValue) will assign values to xValue and yValue thus:

xValue yValue
   1      2
   3      4
   5      6
   7      8
   9      ?

The value of the last yValue will depend on whether or not it has been assigned another value since the previous call to fscanf(). If not, it will be 8 again.

Returning to the original problem, my solution would be to read the first line into a character array using fgets(), which does stop when it reaches the end of the line, and then pass a pointer to this array to a function which does the following:-

Initialise a counter to 0.

Start at the beginning of the array.

Repeat until reaching the end of the array:-

Step over any spaces until encountering something which is not a space, or the end of the array;
If it's not the end of the array, it must be a column, so add 1 to the counter;
Step over the subsequent characters until encountering a space, or the end of the array.

On reaching the end of the array, return the value of the counter.

This function could be coded as follows:

#include <ctype.h>
int count_columns(char *line) {
    int    columns = 0;
    char *p = line;
    while (*p != '\0') {
        p = step_spaces(p);
        if (*p != '\0') columns++;
        p = step_non_spaces(p);
    }
    return columns;
}
char *step_spaces(char *p) {
   while(*p!= '\0' && isspace(*p)) p++;
    return p;
}
char *step_non_spaces(char *p) {
   while(*p!= '\0' && !isspace(*p)) p++;
   return p;
}

The scanf() family of functions are not alone in regarding the end-of-line character as a sort of space. The isspace() function, which I use in my suggested solution, also has this characteristic. That is why the function steps over space characters first. Doing so ensures that it returns the right number of columns (zero) if the line contains nothing but an end-of-line character.

Answer from Graham Patterson

The question concerned extending a basic fscanf(fp, "%f %f", &ar1, &arg2) function call to handle a generalised case of a multiple column file of floating point values (since the question mentioned statistical data, although the example contained integers, I am assuming that floating point will be used for the purpose of example). Since the function cited is from the C library, I am using C as the programming language.

There are a number of problems with the formatted input functions in the C Standard Library. Some of these problems are to do with their implementation, and some are inherent in the data storage model used by the language.

Most authorities would advise using fscanf() with caution, if at all. The main problem with it is that it consumes its input during the decoding process, which makes it impossible to re-try the conversion without returning to the start of the file. Issuing a rewind() or fseek() is possible on a file, but may not succeed on a pipe. Even if it is possible, the overhead of file system interaction may be too much for the application. The alternative is to grab a line at a time (ignoring for the moment what constitutes a 'line'), and then use sscanf() or some other tokenising scheme to decode the line.

Where the input consists of tabulated data it is possible to use sscanf() to determine a suitable format string. One method (example code provided) is to start with an atomic conversion specifier, such as %*lf, which will parse a double value without assignment. We can take advantage of the default white-space separation of numeric input in this example. A working format string can be built by concatenating such specifiers and testing the return of sscanf() with the line under investigation. A -1 return indicates that the last specifier is either one too many for the data, or is of unsuitable type. There is an alternative approach using the %n conversion specifier that is discussed later in this article.

It is thus possible to create a custom format string with a reasonable knowledge of the input data (general structure and data content). The number of fields involved is now known. We obviously need to watch for buffer overflow both in our input line and our dynamic format string. The format string we construct is going to be potentially at least 4 times as long as the input string e.g. %*lf. The input must not only fit within the allocated buffer, but we need to consider the possibility of only reading a partial line. The library function fgets() conveniently includes the newline character if it occurs before the buffer length is exceeded. This is a useful sentinel for partial line input. If such an event occurs the input buffer must be inspected and adjusted. The last valid conversion may not have read the entire token. E.g. '123\0' is valid as 123.0, even if the file actually contained '1234.567\n'. It is rare to need a program that can read arbitrarily long input lines, but limitations in this respect must be documented.

So at this stage we have a reasonable format string to decode our line, and it would work if we removed the assignment suppression characters. Since C is a compiled language, it expects that we know the number of arguments to the function when we write it - we do not have the run-time evaluation options of PERL or PHP, for example. So if we had data in columns that could range from 2 to many more, how do we write our input?

How about:

int result = sscanf(buffer, format_string, &double_arg1, &double_arg2, &double_arg3);

This would work for up to three arguments, but would break for four or more. If the problem was finite we could code one line with enough arguments for the worst case. And then along comes a file that is bigger!

The answer is to read one value at a time. We know how many to read from the previous work. We should also know what sequence of types we require. This decomposes to the basic construct:

int offset = 0;
for(i = 0; i < number_of_arguments; i++) {
   result = sscanf(buffer + offset, "%lf%n", &double_args[i], &offset);
   /* error checks, other processing */
}

The offset is derived via the %n format specifier which records how far into the input source the conversion has proceeded. This has to come after the data conversion specifier to record the consumed characters.

We could have used a similar construct when determining the format string. However, building the string automatically keeps a record of the process. There are times when it would pay to use the format string as a base for the conversion. Typically you may have to do this if %n is not available as a conversion specifier, which is the case with some older compiler libraries. It might also be a good choice if you have a variety of data types in the input, since you build a control string while you investigate the content and number of fields in the input file. In this case the format string would be duplicated for each pass through the loop, less one assignment suppression character ('*'). By walking the actual assignment along the format string we can decode the input buffer one value at a time.

There is no best answer to the general problem. A lot depends on how variable the input data may be in a particular application.

This example reads a row of data into an array. In many cases we are going to have data in row === record, column === variable format. C uses a very low-level implementation of arrays that does not support dynamic re-sizing. We can re-allocate the memory occupied by an array (providing it was dynamically allocated originally via a malloc() / calloc() / realloc() call). What we cannot do is have an array,

double array[10];
and automatically extend it with something like:
int more = 12;
array[more]; /* will not work - array bounds exceeded */

So the input decoding is only the first part of the problem. To read in a table of numbers, we would probably allocate an array of pointers to array of double dimensioned by the number of columns in the data. Each column array will have to be dimensioned by the number of rows in the data. If we know (or can pre-scan the input to find out) the number of rows, it is easy. If we have to do it during the read process the algorithm becomes one of allocating an initial space, and then re-sizing upwards when the space is used. Unless we know how many lines are in the file in advance, we can expect to become intimately acquainted with the realloc() function!

If we use another language we may avoid a lot of this processing overhead. The C++ STL can handle the indeterminate data size more easily than hand-coding our own memory management routines in C. For that matter the dynamic array handling of PERL and its field separation facilities make this sort of data parsing comparatively trivial. But these may not be the best languages for the processing of the data.

If I was just interested in re-ordering or formatting data of this type I would consider AWK or PERL. If I was looking to do extensive mathematical processing I would look to a compiled language, possibly with extensive large number and numeric function support.

/* module title  : fscanfex.c
 * author        : Graham Patterson (G.A.Patterson@btinternet.com)
 * revision history : [Editorial note, for reasons of space, I snipped this
 * problems      : Demonstration of concept. Not production code.
 * description   : Demonstrates a method for determining the field composition of a numeric data 
 *                            file with a view to constructing a scan format string.    */
#include <stdio.h>
#include <string.h>
#ifndef MAX_LEN
#define MAX_LEN 1024
#endif
int double_fields(const char *buffer) {
    char decode_format[MAX_LEN * 4 + 4]; 
/* Worst case - * 4 format characters per input character, plus one safety */
    const char *format = "%*lf";
    int conversions = 0;      int result = 0;
    decode_format[0] = '\0';
    do     {
        strcat(decode_format, format);
        result = sscanf(buffer, decode_format);
        printf("Testing %s with %s, obtained %d\n", buffer, decode_format, result);
/* With assignment suppression we don't get the number of conversions, 
 * so we count them ourselves. */
        if(result != -1) conversions++;
    }
    while(result != -1 && strlen(decode_format) <= (MAX_LEN * 4));
    return conversions;    
}
/* It is assumed that this function is only called when a field is expected however an error flag or exception is still advisable! Setting an invalid offset works, but has some stylistic implications. */
double read_double(const char *buffer, int *offset) {
    double value = 0.0;   int conversion_length = 0;
    int result = sscanf(buffer + *offset, "%lf%n", &value, &conversion_length);
    printf("Read %f at %d\n", value, *offset);
/* update the offset into the line buffer */
    *offset += conversion_length; 
    if(result > 0) return value;
    else { *offset = -1;  return 0.0; }
}
int main(void) {
    FILE *fp;
    if(fp = fopen("test.dat", "rt")) {
        char buffer[MAX_LEN];      int conversions = 0;
        fgets(buffer, MAX_LEN - 1, fp);
        while(!feof(fp)) {
            int offset = 0;     int i;
            conversions = double_fields(buffer);
            printf("Data file contains %d columns\n", conversions);
            for(i = 0; i < conversions; i++) {
                double value = read_double(buffer, &offset);
                if(offset != -1) printf("%f\n", value);
                else puts("Error reading double");
            }
            fgets(buffer, MAX_LEN - 1, fp);
        }
        fclose(fp);
    }
    else puts("Unable to open test data file");
    return 0;
}

(Answer from Chris Main)

If you are using C then you can use the library function strtok() to work out how many columns are present in the first line. It takes two arguments. The first argument (char *) should be the line the first time strtok() is called and NULL on subsequent calls. The second argument should be a string containing the characters separating the values. Call it until it returns NULL to determine how many columns there are in the first line:

int column_count = 0;
if(strtok(first_line, " ")) {
    ++column_count;
    while(strtok(NULL, " ")) ++column_count;
}

Using this technique, I don't think there is any way of generalising the call of fscanf() because it takes a variable number of arguments. Instead for each row of the file there needs to be a loop so that fscanf() is called the same number of times as there are columns, reading one value each time:

while(!feof(file)) {
    int i, n;
    for(i = 0; i < column_count; i++) {
        if(fscanf(file, "%d", &n) != 1) break;
    }
    if(i == 5) { 
/* Another row read successfully */
    }
    else if(i > 0)
    {
/* Error condition - incomplete row */
    }
    else /* i == 0 */ { /* End of file */     }
}

Unfortunately it is not possible to rely entirely on feof() when doing this. The return value of fscanf() must be checked each time to detect the end of the file as well as an incomplete last row.

A full listing is given in scan.c and, for comparison, a C++ version in scan.C. As will be seen, the C++ version relieves the programmer of a lot of error handling and memory management.

[I have renamed the second of those files scan.cpp. Note that these together with other source code files will go on our web site. FG]

from David Stone <<dfstone@lithoi.demon.co.uk>>

My first reaction on reading this question is: why does the questioner want to do this? C is a low-level language, and many packages and languages are more suitable for dealing with tables of numbers. What is the larger task to be done with these numbers? The data are statistical, apparently; there are many packages available to do statistical analysis. I know that the free R package (http://cran.r-project.org/) [would anyone like to review that for us? FG] can handle this format; so can many other packages. If the analysis is simple enough, you can probably do it by reading the numbers into a spreadsheet.

If the word 'statistical' is a red herring, and it really is necessary to use a general-purpose programming language, I should still avoid C if possible. Perl can cope with the desired input format fairly easily, certainly more easily than C; it may be suitable, depending on what is to be done to the numbers. The best language in the circumstances depends on what the questioner knows, and what is available to him, as well as the nature of the full problem.

If the questioner insists that C must be used, then the first thing I should say would be to discourage the use of fscanf(). It is inflexible because the conversion of the input is mixed up with reading the input. This makes error handling difficult (for example, coping with a line with too few values). Much better would be to read a line with fgets(), and then start taking apart the line with strtol(). (Incidentally, it is possible that the questioner intended to program in C++; I assume not, because fscanf() is part of the C i/o library.)

The question is not clear as to whether there can be an arbitrary number of columns and lines, or whether they have upper bounds. Especially since the questioner is a novice, I should suggest setting bounds if possible, because there are several tricky points in dealing with dynamically allocated arrays, necessary both for the input line and for the arrays if there are no bounds.

If we assume an upper bound MAX_COLUMNS on the number of columns can be set, then the questioner could use an array of arrays

long iValue[MAX_LINES][MAX_COLUMNS];

and read the data with two nested loops. The outer would loop through the lines, calling fgets() each time, and then using an inner loop which called strtol() until end-of-line, filling in the array iValue[iCurrentLine]. The code would check that no more than MAX_LINES lines were read, and no more than MAX_COLUMNS numbers found on each line. Other necessary checks are that no line was longer than the line buffer passed to fgets(), that each line had the same number of values as the first, that each number fitted into a long, and that each line had no extraneous rubbish. These checks will all fit inside such a structure. They do, however, require the designer to think through how to do error reporting.

At this point I should hope that the questioner was beginning to think that coding it in C was more complex than he had originally thought, and I should re-introduce the ideas I started with, about using another package or language.

Questions

Q1. (from Simon W. Day)<`<106161.304@compuserve.com>`>

Bug that took three evenings to find

Strictly this is not a question, but I would welcome comments on it.

long k,n;
float q,p;
n=4096; 
//or other power of 2 
p=n;
q=log(p)/log(2.0);  //gives 12.00000
k=q;                //gives 11 !!!!

QC and Symantec C and gave k=12 as we would hope.

VC++ version 5 gets k=11. The reason seems clear (12.0000 must really be something like 11.99999) but your readers comments on how to avoid such pitfalls might be of general interest.

Q2. (from Silas Brown) www.flatline.org.uk/~silas

Spot the Bug

I recently found a bug in my code, which I have now fixed but thought it would make an interesting exercise. In fact there are lots of things wrong with it, but one in particular caused a problem. Here is the relevant part of the code. It has been in my program (and I have been using it) for nearly three years; if I were starting again then I would do it rather differently now.

int InString::addLineFromFile(FILE* f) {
    char temp[TEMP_BUFLEN+1];
    if(!fgets(temp,TEMP_BUFLEN,f)) return(EOF);
    do {
        int i=strlen(temp)-1;
        if(temp[i]<' ') {
         while(i>=0 && temp[i]<' ') temp[i--]=0;
            addString(temp); return('\n');
        } else addString(temp);
    } while(fgets(temp,TEMP_BUFLEN,f));
    return('\n');
}
void HttpHeader::readMimeHeader(FILE* fp) {
    InString s;
    while(!feof(fp)) {
        s.clear();
        s.addLineFromFile(fp);
        const char* str=s.getString();
        if(!str[0]) return;
        ...
    }
}

The file that is being read may have been generated on another operating system.

Q3. (from comp.lang.c++.moderated)

Is there a header file I need to include before I can use:

using namespace std;

I cannot figure out which header files (or combination) makes the std namespace available.

This looks like a simple question from a raw novice, but looks can be deceptive. When answering think carefully about any problems you might get if you tried to compile a file with nothing else in it than:

using namespace std;
int main() { return 0; }

Now ask yourself if there is a tidy solution that avoids including any specific file. Finally you might like to comment on why the original question is unlikely to be from a competent programmer with experience in correct use of C++.

Q4. (from Silas Brown) www.flatline.org.uk/~silas

In the expression f(a(),b(),c()), the order of evaluation of the functions a, b and c is not defined. But what about the expression f(g(a()),g(b()),g(c())) - is each call of a, b and c guaranteed to be followed by a call of g, or is the compiler free to call a, b and c first and then g three times on the three results?

This can make a difference if, for example, a, b and c all write to the same static buffer, and the function g copies it into a new area of memory.

Notes:

More fields may be available via dynamicdata ..