Title: Code Critique Competition 96

Author: Martin Moene

Date: 04 November 2015 09:11:52 +00:00 or Wed, 04 November 2015 09:11:52 +00:00

Summary: Set and collated by Roger Orr. A book prize is awarded for the best entry.

Body:

Participation in this competition is open to all members, whether novice or expert. Readers are also encouraged to comment on published entries, and to supply their own possible code samples for the competition (in any common programming language) to scc@accu.org.

Note: If you would rather not have your critique visible online, please inform me. (We will remove email addresses!)

Last issueâ€™s code

I have some C code that tries to build up a character array using printf calls but Iâ€™m not getting the output I expect. Iâ€™ve extracted a simpler program from my real program to show the problem.

With one compiler I get "Rog" and with another I get "lburp@".

Iâ€™m expecting to see:

"Roger: 10

Bill: 5

Wilbur: 12"

What have I done wrong?

Can you give some advice to help this programmer?

The code is in Listing 1.

#include <stdio.h>
#define ARRAY_SZ(x) sizeof(x)/sizeof(x[0])
typedef struct _Score
{
  char *name;
  int score;
} Score;

void to_string(Score *scores, size_t n,
  char *buffer, size_t len)
{
  for (size_t i = 0; i < n; i++)
  {
    size_t printed = snprintf(buffer, len,
      "%s:\t%u\n",
      scores[i].name, scores[i].score);
    buffer += printed;
    len -= printed;
  }
}
void process(char buffer[])
{
  Score sc[] = {
    { "Roger", 10 },
    { "Bill", 5 },
    { "Wilbur", 12 },
  };
  to_string(sc, ARRAY_SZ(sc),
    buffer, ARRAY_SZ(buffer));
}
int main()
{
  char buffer[100];
  process(buffer);
  printf(buffer);
}

Listing 1

Critiques

Mathias Gaunard

The main problem of this snippet is the ARRAY_SZ macro, meant to compute the size of an array. This macro will accept pointers as input but provide the wrong answer, in this case sizeof(char*)/sizeof(char), which is the word size, 4 bytes for 32-bit systems. This explains why the result is "Rog" on some systems; only 4 bytes were written, 3 characters plus the null byte.

With C++, it is possible to write a function that provides the same functionality but that will lead to errors whenever pointers are passed, by passing the array by reference and using templates to deduce its size:

  template<class T, size_t N>
  size_t array_sz(T(&)[N]) { return N; }

By using this function instead of ARRAY_SZ, we can easily isolate the errors. One way to fix this is to modify process to also take the array by reference.

  template<size_t N>
  void process(char (&buffer)[N]);

This will make the code work and display the expected result, however, it still has an issue with how it handles the case where the buffer is not large enough to hold all score listings. As we saw earlier, the code incorrectly claimed that the buffer was word-size-sized, but it should still have always given "Rog" as a result, and never "lburp@". snprintf can return a negative value on failure, and returns the number of characters it would have written if the buffer is not big enough. It is easy to trigger erroneous output like "lbur" simply by hardcoding printed to (size_t)-1.

It is therefore necessary to do

  printed = min(printed, len)

to avoid writing past the end or before the start of the buffer. Alternatively, one could use dynamically-sized buffers to not impose arbitrary limits, which can be done easily by using C++ iostreams.

Other miscellaneous issues:

_Score is a name reserved for the implementation, and should not be used. All names starting with an underscore followed by an uppercase letters are reserved. By using C++, the typedef/struct becomes redundant anyway, so one should just use struct Score.
The type of the name member in Score should be const char* rather than char*, since it is initialized from string literals, which are const in C++, and shouldnâ€™t be modified regardless since it is undefined behaviour to do so.
The first argument to printf should be a format, it is therefore potentially dangerous to call printf(buffer) directly in case we ever chose to put special characters in the score entries. Instead, one should write printf("%s", buffer);.

Peter Sommerlad

First of all, if a C program misbehaves, I suggest compiling it with a C++ compiler.

Sidenote: Well, that is what I would do and also I would highly recommend to no longer use C at all. With the exception of very, very small targets in the embedded area (8 bit controllers), decent C++ implementations should be available and C should no longer be taught to students. It just lacks any means of abstraction and is too hard to use correctly and even if it is, it is too easy to break things by â€˜maintenanceâ€™.

Now to the problem at hand. Our IDE Cevelop and my C++ compiler tells me several problematic points, including the culprit of the underlying issue.

Let us start our journey in main():

  int main()
  {
    char buffer[100];

The array line is something I wouldnâ€™t have done in C++ at all and I was recommending of not allocating arrays on the stack in C at all, if you can not control the size written. Code like that is behind most security vulnerabilities in the past decades. Cevelop (set to parse as C++14) will claim

Found C-Array: {0}
Un- or ill-initialized variable found

The first note is that we try to rewrite code using plain arrays to use the better and copyable alternative std::array instead (or std::string). The second comes from our C++11 checker that looks for variables not initialized with the initializer list syntax and transforms them. Even in C code this might be a good idea, since the memory of buffer can contain arbitrary values (only by luck some bytes are 0 if we use C-style string handling in the general case).

Before we apply the quick fixes let us go further to the code of main().

  process(buffer);

Here an array is passed as a function argument and our student should know that any array passed as such degenerates to a pointer to the first element and the dimension of the array is lost. This is still true in C++ and therefore I recommend to never use a plain array as a function parameter or as an argument. It just works without any safeguards and having a programmer deal with array bounds checking explicitly is very error prone and again leads to another big share of software vulnerabilities.

  printf(buffer);

Even the innocent looking last line is problematic. The printf function takes at least one char pointer argument, but interprets its content. Since we will not know what content our process function is storing in the buffer array, we do not know if printf would be expecting more arguments, because buffer contains any special printf formatting symbols, such as "%s", for example. This could lead to leaking unintended information or in the better case to program crashes, because of illegal memory addresses accessed. It is a similar problem as with the XKCD comic (https://xkcd.com/327/) with printf syntax instead of SQL.

A better C function to output a plain C string would be

  puts(buffer);

instead, which at least just expects buffer to be NUL terminated.

Wow, a full screen full of text with just 3 lines of code. Roger selected a good example this time!

Now let us look one step down in the call chain to process:

  void process(char buffer[])

as said above, this array parameter is equivalent to writing it as a pointer parameter

  void process(char *buffer)

and that will make the real error below (stay tuned) much more evident.

{
  Score sc[] = {
    { "Roger", 10 },
    { "Bill", 5 },
    { "Wilbur", 12 },
  };

The lines above again triggers two messages, one from our plug-in and one from the C++ compiler:

Found C-Array: {0}
ISO C++ forbids converting a string constant to 'char*'

The first one tells us again that plain arrays should be verboten. We can automatically change that to use std::array<Score,3> instead, but that would be C++ already. The second warning tells us something about the differences between C and C++, where C++ makes string literal const char arrays, whereas classic C made those char pointers (even without const, which wasn't invented then). So in theory, one could overwrite the literal values for the names given here. This also tells something about the type Score that keeps plain char pointers and no size information. In production code, this is a no-no, because you can not make any safe memory management around such a string representation.

And now to the definitive culprit.

    to_string(sc, ARRAY_SZ(sc),
     buffer, ARRAY_SZ(buffer));
  }

ARRAY_SZ(buffer) expands to

  sizeof(buffer)/sizeof(buffer[0])

and this generates the compiler warning:

<ArticleBodyIndent> 'sizeof' on array function parameter 'buffer' will return size of 'char*' [-Wsizeof-array-argument]</ArticleBodyIndent>

which we can compute on a 64 bit computer to become 8 instead of the 100 our student expected and the equivalent of the array with the pointer makes it obvious. to_string is by the way a bad name for new code, because it is now a function overload in C++â€™s std namespace and could make problems, when this code is naively ported to C++ with a using namespace std in scope (BTW, Cevelop provides a refactoring away from such practice). Cevelop also suggest to change the macro for ARRAY to a C++ inline (template) function and can do so. In addition this will avoid the potential bug, since the macro is not having parenthesis around its expansion and not around the second use of the parameter. This can cause parsing havoc if used with non-literals as arguments.

So a more correct version of

  #define ARRAY_SZ(x) sizeof(x)/sizeof(x[0])
would be
  #define ARRAY_SZ(x) (sizeof(x)/sizeof((x)[0]))
but even better in C++11:
  template<typename T1>
  inline constexpr auto ARRAY_SZ(T1&& x) ->
    decltype(sizeof (x) / sizeof (x[0]))
  {
    return (sizeof (x) / sizeof (x[0]));
  }

But again that is to no avail in the correct working of process. There are several problems to be resolved with its design. First, passing the buffer from the outside, requires cautions use of its memory resource, so we need to pass the size of the available space. But there is still a problem, we do not know, if the size was sufficient or not. In such a case, it might be better to return an indication if the size was sufficient. This could be a bool, denoting success (not very helpful), the size actually used and an error code, if insufficient (i.e., -1) or best, the size required for the output, if it was too small, which could be a burden to implement and to use. In any case, passing in memory (buffer) to be used as function output is a hassle. In C++ I would recommend to use an ostringstream which will automatically expand an underlying string object and return that by value. This is one area where C is really a burden vs. C++.

It is also unrealistic, that the Score array sc is local to the function, so this must be passed as well, again explicitly passing its size, meaning we need define 4 parameters and pass 4 interdependent arguments. Using a std::vector<Score> in C++ would simplify that again.

Now to the last function in the code where the student actually shows, that she/he understands a bit about passing arrays as pointers with an explicit size. But again, such an API (inherent in C) is easy to use wrongly (as we have seen) by passing inconsistent sizes. Also a return type of void signals side effects, the code can have quite unintended ones.

  void to_string(Score *scores, size_t n,
    char *buffer, size_t len)
  {
    for (size_t i = 0; i < n; i++)
    {
      size_t printed = snprintf(buffer, len,
        "%s:\t%u\n",
        scores[i].name, scores[i].score);
      buffer += printed;
      len -= printed;
    }
  }

Again, here C gets in the way of safe usage of memory. First snprintfâ€™s return type is int, and there are obscure cases, where it might return a negative number. Second, if the space allowed is too small, it still returns the number of characters that would be needed, so the code will miscalculate the value of len, which might underflow. While underflow and overflow does not trigger undefined behaviour with unsigned types, such as size_t, the resulting value is still well beyond the capacity of the buffer and thus can easily lead to overwriting memory far away from the allotted memory, resulting in another severe security issue or potential crashes from overwriting return addresses.

A quick fix, would be to at least check if printed is smaller than len and only then proceed and otherwise return from the function.

  if(printed >= len) return;

I think I covered most of the problems by now. So how would a refactored C++ version look?

Here is one, that uses the recommendation I put above and getting rid of most of the bad C habits. Its overall design is still not what I consider splendid, but without more context it would be hard to judge if Score should have a constructor, be const or have member functions/related functions, such as an overloaded output operator<<.

  #include <cstdio>
  #include <sstream>
  #include <vector>
  struct Score
  {
    std::string name;
    int score;
  };
  using ScoreList=std::vector<Score>;
  std::string to_string(ScoreList const &scores)
  {
    std::ostringstream out{};
    for (auto const &score:scores)
    {
      out << score.name << '\t'
          << score.score << '\n';
    }
    return out.str();
  }
  std::string process()
  {
    ScoreList sc {
      { "Roger", 10 },
      { "Bill", 5 },
      { "Wilbur", 12 },
    };
    return to_string(sc);
  }
  int main()
  {
    std::puts(process().c_str());
  }

Hope that helps some programmers to become better programmers and write less code.

Gareth Ansell <gareth.ansell@sky.com>

While trying to solve this puzzle I was initially tempted to dive straight for the compiler and start hacking away at the code. However, I managed to resist this urge and engaged brain instead. After which the solution was quite easy to spot. I then modified the code to test my hypothesis, which was proved correct.

The initial problem is in the process() function, where the buffer array is passed as an argument. Since this is decomposed to a pointer, it is not possible for the subsequent call to the to_string() function to determine the size of buffer[] by using sizeof (in the ARRAY_SZ macro).

In C the solution to this is to add a length parameter to the process() function, and use this in the call to to_string(). In C++, a templated solution could be used.

Apart from this there are a few minor niggles:

In _Score score is an int, but in the call to snprintf it is referenced as an unsigned.
The last element of the sc[] array in process[] has a trailing comma
The for loop in to_string() compares a signed to unsigned int.

My working solution is shown below:

  #include <stdio.h>
  #define ARRAY_SZ(x) sizeof(x)/sizeof(x[0])
  typedef struct _Score
  {
    char *name;
    unsigned int score;
  } Score;
  void to_string(Score *scores, size_t n,
    char *buffer, size_t len)
    {
      for(unsigned int i = 0; i < n; i++)
      {
        size_t printed = snprintf(buffer, len,
          "%s:\t%u\n",
          scores[i].name, scores[i].score);
        buffer += printed;
        len -= printed;
      }
    }
  void process(char buffer[], size_t len)
  {
    Score sc[] = {
      { "Roger", 10 },
      { "Bill", 5 },
      { "Wilbur", 12 }
    };
    to_string(sc, ARRAY_SZ(sc), buffer, len);
  }

Matthew Wilson

This one is rather simple, I think: itâ€™s an issue of array->pointer decay.

The ARRAY_SZ() macro follows a familiar pattern in many codebases, used to fill the obvious missing dimensionof() operator that should exist in C (and C++). It works by creating a compile-time constant representing the number of elements in an array by dividing the total size of the array by the size of an element. Works fine for arrays.

The problem is, it also works for pointers and, in C++, for types defining the subscript operator. And in such cases the sizes obtained is almost never correct (and sometimes not compile-time constant). This is all discussed at length in chapter 14 â€“ Arrays and Pointers â€“ of my book Imperfect C++ [1], so I wonâ€™t belabour the point here.

Also discussed in the same chapter of IC++ is the phenomenon of array-pointer decay. Briefly, it allows an array to be used in circumstances where a pointer is required. Also, a function declaration involving an array is interpreted, in a similar vein, as a pointer. Hence, the declaration of process() as

  void process(char buffer[])

is exactly equivalent to the declaration

  void process(char* buffer)

Expressed in this form, the problem is all too easy to see. The size of ARRAY_SZ(buffer) is going to be 4 (32-bit) or 8 (64-bit), and certainly not the 100 of the actual buffer buffer declared in main().

The obvious fix is to change process() to have the signature

  void process(char* buffer, size_t len)

which is hinted at strongly in the earlier defined to_string().

When so amended â€“ along with some const-correction, VC++ compatibility, and use of STLSoftâ€™s STLSOFT_NUM_ELEMENTS() (which makes application of ARRAY_SZ() to a pointer a compile-time, rather than runtime, error) â€“ the code looks like:

  #include <stlsoft/stlsoft.h>
  #include <stdio.h>

  #if defined(_MSC_VER)
  # define snprintf _snprintf
  #endif
  #define ARRAY_SZ(x) STLSOFT_NUM_ELEMENTS(x)
  typedef struct _Score
  {
    char const* name;
    int         score;
  } Score;
  void to_string(Score const* scores, size_t n,
                 char *buffer, size_t len)
  {
    for (size_t i = 0; i < n; i++)
    {
      size_t printed = snprintf(buffer, len,
        "%s:\t%u\n",
        scores[i].name, scores[i].score);
      buffer += printed;
      len -= printed;
    }
  }
  void process(char* buffer, size_t len)
  {
    Score const sc[] = {
      { "Roger", 10 },
      { "Bill", 5 },
      { "Wilbur", 12 },
    };
    to_string(sc, ARRAY_SZ(sc),
      buffer, len);
  }
  int main()
  {
    char buffer[100];
    process(buffer, ARRAY_SZ(buffer));
    printf(buffer);
  }

Reference

[1] Imperfect C++, Matthew Wilson, Addison-Wesley, 2004.

James Holland <James.Holland@babcockinternational.com>

My compiler did not provide any helpful hints as the studentâ€™s program compiles without any errors or warnings. When I ran the program, "Rog" was displayed as the student observed. After a little investigation, I find that the problem is in the parameter of process. I suspect than the student is under the impression that an array is being passed to process, after all the type of the parameter, char buffer[], looks like an array declaration. Unfortunately, despite its looks, the parameter type is equivalent to char * buffer. When the latter declaration is used, it becomes clear that a pointer to buffer is being passed to process and not buffer itself. From within process, it is the size of the pointer that is passed to to_string and not the size of buffer. On my machine, pointers are 4 bytes in length and so the value 4 is being passed to to_string. This explains why the program outputs "Rog". The function to_string, and ultimately snprintf, thinks that buffer is only 4 characters long, enough for three characters and the null terminator.

Unfortunately, it is not possible to pass to a function an array by value. Only a pointer to an array can be passed. This presents no great difficulty, however. If in addition to passing a pointer to the array, the length of the array is passed, process will have all the information it needs about buffer. Within process, the length of the output buffer can be passed straight to to_string as shown below.

  void process(char * buffer, size_t length)
  {
    Score sc[] = {{"Roger", 10}, {"Bill", 5},
      {"Wilbur", 12},};
    to_string(sc, ARRAY_SZ(sc), buffer, length);
  }

When this modification is made, things look a lot better and the program produces the desired result. There are, however, some unresolved problems than could manifest themselves in the studentâ€™s real program.

It is assumed that the student is using snprintf (as opposed to sprintf) because he/she does not want data to be written beyond the limits of buffer. This is a laudable desire but unfortunately the code, as it stands, does not provide that protection in general. There is still a potential problem when the size of the data to be written exceeds the size of buffer. This is demonstrated more conveniently if the size of buffer is reduced instead of increasing the size of the data.

Suppose, for example, that instead of buffer being declared 100 characters long, it had been declared 17 characters long. The program would correctly write to buffer the string representing the data for Roger. snprintf returns the length of the string (not including the null terminator) that it attempts to write, in this case, 10 characters. This value is assigned to the variable printed. This value is then subtracted from len (that currently has a value of 17) leaving 7. The program then attempts to write the data for Bill. This string is 8 characters long. The program (or more specifically snprintf) writes as much of the data to buffer as it can without overflowing buffer. Although the output string has been truncated, nothing disastrous (as far as program execution is concerned) has occurred. Next, the length of string for Bill is subtracted from len. The value of len is currently 7 and the length of the string for Bill is 8 characters. The subtraction to be performed is, therefore, 7 minus 8. This is where problems start. The variables printed and len are both of type size_t (an unsigned type). The result of the subtraction is not -1, as expected, but a very large positive number. On my machine the value of len at this point is 4294967295. The program then goes around the loop once more to write the data for Wilbur and, thinking there is plenty of space in buffer (because len is a large number), writes Wilburâ€™s string beyond the end of buffer, probably with disastrous consequences.

This behaviour can easily be prevented by inserting the following code just after the snprintf statement. It will prevent the program going around the loop again when buffer is full and will also print a message explaining the problem.

  if (printed >= len)
  {
    printf("Buffer exhausted. Results may be "
      "incorrect.\n");
    break;
  }

While on the subject of buffer length, there is another problem that occurs if buffer is declared as having zero size. I admit this is unlikely to occur in an otherwise fully debugged program but there is, at least, a theoretical point to be made here. An array of zero bytes will have an address, so it can be referenced. Such an array will, in all probability and not unreasonably, occupy zero bytes of memory. Given this, printf (as used in the last statement of main) will start printing at a memory location that has nothing to do with buffer. It could print anything of any length, depending on what data it stumbles across. This behaviour must be prevented if there is any possibility of buffer being of zero length. A simple if statement around printf would suffice. I leave the details to the student.

There is another problem with the printf statement; it is being used in a potentially unsafe way. printf is declared (in stdio.h) as having one or more parameters. The first parameter is a format control string. The studentâ€™s printf statement has one parameter only and so the parameter must be the control string. It can be seen from an inspection of the printf statement that the control string is the contents of buffer. The contents of buffer came originally from the array sc. If one of the names in sc were to be changed, say, from "Roger" to "%sRoger", the program will probably crash in spectacular fashion. This is because printfâ€™s format control string now says there is a string parameter to be printed despite there not actually being one. The result is undefined behaviour. To prevent this, a literal string should be supplied as the format control string and the string to be printed (in this case buffer) supplied as the second parameter as shown below.

  printf("%s", buffer);

printf will now simply print the contents of buffer without interpreting any characters sequence as format control characters, as required.

Although it does not show itself in the studentâ€™s test program, there is another potential problem lurking in the wings. It is the definition of the macro ARRAY_SZ. Suppose the student decides to make use of this macro somewhere else in his or her code. The student may write something like the following statement, for whatever reason.

  size_t x = 25 % ARRAY_SZ(sc);

Let us assume that sc is the same array as defined in the studentâ€™s test program. ARRAY_SZ(sc) should, therefore, produce the value 3. The expression becomes 25 % 3 which is equal to 1. The variable x should, therefore, be initialised with the value of 1. In fact x is assigned a value of 0, contrary to all expectation. How can this be? All becomes clear if the macro is expanded to produce the expression seen by the compiler, as illustrated below.

  size_t x = 25 % sizeof(sc) / sizeof(sc[0]);

The expression will be evaluated from left to right. So the first term that is evaluated is 25 % sizeof(sc) or 25 % 24 which equals 1. Next, the term sizeof(sc[0]) is evaluated, this equals 8. Finally, 8 is divided into 1 giving zero. It can now be seen that brackets are required around sizeof(sc) / sizeof(sc[0]) so that this part of the expression is evaluated first. This is achieved by enclosing the definition of ARRAY_SZ(x) in parentheses.

  #define ARRAY_SZ(x) (sizeof(x)/sizeof(x[0]))

In fact it is good practice to enclose the definition of all but the simplest macros in parentheses to prevent this type of error.

There are a couple of inconsistencies in the program. Score::score is of type int and yet it is being printed by snprintf as if it were unsigned. Assuming Score::score is meant to be signed, then snprintfâ€™s format control string should be "%s:\t%d\n".

Also, quoted text strings are considered constant and yet Score::name is declared as a pointer that can modify what it is pointing to. The declaration const char * name provides the remedy. This will prevent accidentally attempting to write to a constant string as a compile-time error will result.

I noticed that the student provided a tag name when defining the Scores structure by use of a typedef statement. A structure need only have a tag name when the structure makes reference to pointers of the same type. This is not the case in the studentâ€™s program and so the tag name is not required. Incidentally, it looks as if the student has made an attempt to prevent the tag name from clashing with the type name by use of a leading underscore character. This is not necessary as the two names are in different â€˜name spacesâ€™. The tag and the type could both be named Score.

There is a redundant comma at the end of SCâ€™s initialiser list. The compiler is perfectly happy with this and it does not affect the meaning of the program. It is allowed, at least in part, to make the job of automatic code generators a little easier. However, as it is not required, I prefer not to see a trailing comma in hand-written code.

Finally, it might be constructive to discuss the choice of variable names. To some degree this is a matter of personal style but there are some guidelines that should be observed. The names of variables (and other identifiers) should reflect their meaning and should not be excessively abbreviated. I would prefer to see length_of_buffer rather than len and expected_print_length rather than printed, for example. Selecting suitable names can be quite tricky and I am not suggesting my examples are ideal but I do think they help in understanding how the program works.

I have made quite a few corrections and suggestions in reviewing the studentâ€™s code. This should not be seen as a damning criticism designed to dishearten the student. On the contrary, it is designed to encourage the student to complete his or her project in particular and to continue to learn about the fascinating topic of computer programming and software development in general.

Commentary

With five pretty comprehensive entries Iâ€™m not sure thereâ€™s much for me to add. About the only thing no-one remarked on was the inconsistent brace positioning!

It seems a shame that snprintf() doesnâ€™t provide a foolproof safe replacement for sprintf() â€“ examples like this code show how easy it is, given the variety of return values and the dangers of implicit conversion between signed and unsigned integer values, to use snprintf() in an unsafe way.

One additional point that might be worth making is that Visual Studio 2015 does provide snprintf() in <stdio.h> (although I havenâ€™t yet found where this is documented on MSDN.) This function seems to follow the behaviour required by the C11 standard. However, the Microsoft specific function _snprintf() is subtly different; in particular it does not ensure the target buffer is null terminated. This is an additional source of potentially dangerous confusion around this function call. Even beyond that, the implementation of snprintf() in the mingw implementation of g++ 4.9.2 on Windows also fails to ensure the buffer is null terminated.

The moral of this critique is to be extremely careful if you use snprintf() in C code â€“ or in C++.

The Winner of CC 95

There were five good critiques and I think each one would have helped the person with the original problem to understand what was wrong and to fix it. However, I think Peterâ€™s critique covered the most ground and I have awarded him the prize for this issueâ€™s critique.

Code Critique 96

(Submissions to scc@accu.org by Oct 1st)

Thanks are due to Hubert Matthews for the idea that inspired this critique.

I have written some code to read in a CSV file and handle quoted strings but I seem to get an extra row read at the end, not sure why.

If I make a file consisting of one line:

--- Book1.csv ---

word,a simple phrase,"we can, if we want, embed commas"

--- ends ---

I get this output from processing the file:

    Rows: 2
    Cells: 3
      word
      a simple phrase
      "we can, if we want, embed commas"
    Cells: 1

What have I done wrong?

The code is in Listing 2.

// Reading CSV with quoted strings.
#include <iostream>
#include <string>
#include <vector>

typedef std::string cell;
typedef std::vector<cell> row;
typedef std::vector<row> table;

table readTable()
{
  char ch;
  table table; // the table
  row *row = 0; // the row
  cell *cell = 0; // the cell
  char quoting = '\0';
  while (!std::cin.eof())
  {
    char ch = std::cin.get();
    switch (ch)
    {
    case '\n':
    case ',':
      if (!quoting) {
        cell = 0;
        if (ch == '\n') {
          row = 0;
        }
        break;
    case '\'':
    case '"':
        if (quoting == ch) {
          quoting = '\0';
        }
        else if (!cell) {
          quoting = ch;
        }
      }
    default:
      if (!row) {
        table.push_back({});
        row = &table.back();
      }
      if (!cell) {
        row->push_back({});
        cell = &row->back();
      }
      cell->push_back(ch);
      break;
    }
  }
  return table;
}

int main()
{
  table t = readTable();

  std::cout << "Rows: " << t.size() << "\n";
  for (int r = 0; r != t.size(); ++r) {
    std::cout << "Cells: "
      << t[r].size() << "\n";
    for (int c = 0; c != t[r].size(); ++c)
    {
      std::cout << "  " << t[r][c] << "\n";
    }
  }
}

Listing 2

You can also get the current problem from the accu-general mail list (next entry is posted around the last issueâ€™s deadline) or from the ACCU website (http://www.accu.org/journals/). This particularly helps overseas members who typically get the magazine much later than members in the UK and Europe.

Notes:

More fields may be available via dynamicdata ..

Journal Articles