Title: Code Critique Competition 118

Author: Bob Schmidt

Date: 05 July 2019 00:55:29 +01:00 or Fri, 05 July 2019 00:55:29 +01:00

Summary: Set and collated by Roger Orr. A book prize is awarded for the best entry.

Body:

Please note that participation in this competition is open to all members, whether novice or expert. Readers are also encouraged to comment on published entries, and to supply their own possible code samples for the competition (in any common programming language) to scc@accu.org.

Note: if you would rather not have your critique visible online please inform me. (Email addresses are not publicly visible.)

Last issueâ€™s code

Iâ€™m trying to do some simple statistics for some experimental data but I seem to have got an error: I donâ€™t quite get the results I am expecting for the standard deviation. I canâ€™t see whatâ€™s wrong: Iâ€™ve checked and I donâ€™t get any compiler warnings. I ran a test with a simple data set as follows:

     echo 1 2 3 4 5 6 7 8 9 | statistics
     mean: 5, sd: 1.73205

The mean is right but I think I should be getting a standard deviation of something like 2.16 for the set (without outliers this is [2,8].)

Can you solve the problem â€“ and then give some further advice?

Listing 1 contains statistics.h and Listing 2 is statistics.cpp.

namespace statistics
{
  // get rid of the biggest and smallest
  template <typename T>
  void remove_outliers(T& v)
  {
    using namespace std;
    auto min = min_element(v.begin(), v.end());
    auto max = max_element(v.begin(), v.end());
    v.erase(remove_if(v.begin(), v.end(),
      [min, max](auto v) {
        return v == *min || v == *max; 
      }),
      v.end());
  }
  template <typename T>
  auto get_sums(T v)
  {
    typename T::value_type sum{}, sumsq{};
    for (auto i : v)
    {
      sum += i;
      sumsq += i * i;
    }
    return std::make_pair(sum, sumsq);
  }
  template <typename T>
  auto calc(T v)
  {
    remove_outliers(v);
    auto sums = get_sums(v);
    auto n = v.size();
    double mean = sums.first / n;
    double sd = sqrt((sums.second -
      sums.first * sums.first / n) / n - 1);
    return std::make_pair(mean, sd);
  }
}

Listing 1

#include <algorithm>
#include <cmath>
#include <iostream>
#include <iterator>
#include <vector>
#include "statistics.h"
void read(std::vector<int>& v)
{
  using iter = std::istream_iterator<int>;
  std::copy(iter(std::cin), iter(),
    std::back_inserter(v));
}
int main()
{
  std::vector<int> v;
  read(v);
  auto result = statistics::calc(v);
  std::cout << "mean: " << result.first
    << ", sd: " << result.second << '\n';
}

Listing 2

Critique

Dave Simmonds

remove_outliers will run down the container 3 times, which could be costly on a large dataset â€“ weâ€™ll come back to this later.

remove_outliers actually removes the outliers from its input. That sounds OK, and looking at the way calc takes a copy, it works here. But we could be more efficient by taking a const ref and not changing the input. More later.

The lambda in remove_outliers reuses the variable v. This is OK but confusing. We could call it e for element.

In calc, the formula for the standard deviation is the version which means we only have to run down the container once. But as mentioned above weâ€™ve already gone down it three times in remove_outliers, so this could be better. We actually could run down the container only once, keeping track of and counting outliers and adding up sum and sumsq, then removing the amount for the outliers at the end. Code below.

Now for the issue that gives us the wrong result: the formula is trying to divide by n-1, but actually is dividing by n and then subtracting 1. So we need to change n â€“ 1 to (n â€“ 1)

And now another issue. All our calculations here are based on integers, we will get rounding down to nearest integer â€“ which will give us the answer 2 in the example. We need to convert to double before we do any division.

The line(s) in question becomes:

  double sd = sqrt((sums.second -
    (double)(sums.first * sums.first) / n)
    / (n - 1));

This also applies to the mean calculation, which should be:

  double mean = (double)sum / n;

The next issue is what if n is less than 2 â€“ we will divide by zero â€“ so there needs to be a check for that.

Hereâ€™s my version of statistics.h, which addresses all these issues:

  namespace statistics
  {
    template <typename T>
    auto calc(T const & v)
    {
      typename T::value_type sum{}, sumsq{};
      typename T::value_type min{}, max{};
      int min_count{}, max_count{};
      // run down the container just once,
      // collecting what we need
      for (auto i : v)
      {
        if (!min_count || i < min)
        {
          min = i;
          min_count = 1;
        }
        else if (i == min)
        {
          ++min_count;
        }
        if (!max_count || i > max)
        {
          max = i;
          max_count = 1;
        }
        else if (i == max)
        {
          ++max_count;
        }
        sum += i;
        sumsq += i * i;
      }
      auto n = v.size();
      // now we remove the outliers
      if (min_count)
      {
        sum -= min*min_count;
        sumsq -= min*min*min_count;
        n -= min_count;
      }
      if (max_count && max != min) // check if
      // max == min, if so we have already
      // removed it above
      {
        sum -= max*max_count;
        sumsq -= max*max*max_count;
        n -= max_count;
      }
      // check we have enough input
      if (n < 2)
      {
        throw(std::runtime_error(
            "not enough data"));
      }
      // and do the final calcs
      double mean = (double)sum / n;
      double sd = sqrt((sumsq -
        (double)(sum * sum) / n) / (n - 1));
      return std::make_pair(mean, sd);
    }
  }

James Holland

As the student mentions, the compiler issues no warnings. This suggests that there is something wrong with the program logic. It turns out that the line of code that calculates the standard deviation, sd, is in error. Replacing the statement with the following code results in the correct answer for the input provided.

  double variance = 0.0;
  for (auto x : v)
  {
    variance += std::pow(x - mean, 2);
  }
  variance /= n - 1;
  double sd = sqrt(variance);

Unfortunately, the program is still not correct as demonstrated by entering the test data again but omitting the first number, (1). Removing the outliers leaves [3 â€¦ 8] which has a mean of 5.5 and a standard deviation of 1.87083. The program gives a mean of 5 and a standard deviation of 1.94936. Something is clearly still wrong. It turns out that the mean is being calculated incorrectly. It is, in effect, being rounded down to the nearest integer. This is because the sum, an integer, is being divided by the number of samples, another integer. The result is also an integer. The integer is then converted to a double and used to initialise mean. But it is too late; the fractional part of the mean has already been lost. The reason why the correct answer was produced with the original test data is that the mean just happened to be an integer value. To correct the problem I suggest casting sums.first to a double before dividing by the number of samples. The program should now work correctly with any input.

The general advice one receives is not to hand-craft loops oneself but to make use of algorithms already provided by the C++ standard library. The original code already uses some library function, such as std::copy(), but letâ€™s see what else the library has to offer.

Instead of using the two library functions min_elenent() and max_element(), the single function minmax_element() can be used with a slight gain in efficiency. This function returns a std::pair of iterators, the first pointing to an element with the smallest value and the second pointing to an element with the largest value. The std::pair of iterators can then be captured by the lambda of the remove_if() function and used by selecting the first and second elements of the pair.

It is possible to get away without actually erasing the â€˜removedâ€™ elements of the vector. We could modify remove_outliers() to return the logical end of the vector as returned by remove_if(). The logical end of the vector can be used in further processing instead of the actual end of the vector. The small amount of execution time gained may be beneficial in some circumstances.

We now turn are attention to get_sums(). This function contains a single loop that calculates the sum and the sum of the squares. While there is nothing disastrously wrong with the code, the loop could be replaced by two library functions as shown in the listing below. Whether using the library makes the code more readable is debatable, however. At least the number of lines of code in the get_sums() function has been reduced from seven to three.

It is also possible to make use of the standard library in calculating the standard deviation, sd. Instead of employing the code shown at the beginning of this text, the standard library function accumulate() can be used to calculate the square of the standard deviation. This value is called the variance. Obtaining the square root of the variance, therefore, gives the standard deviation as shown in the complete listing of my version of the statistics.h.

  #include <algorithm>
  #include <iostream>
  #include <numeric>
  namespace statistics
  {
    // Get rid of the biggest and smallest.
    template <typename T>
    auto remove_outliers(T & v)
    {
      const auto min_max =
        std::minmax_element(v.begin(), v.end());
      return std::remove_if(v.begin(), v.end(),
       [min_max](auto element) {
       return element == *min_max.first ||
         element == *min_max.second;});
    }
    template <typename T>
    auto get_sums(const T & v,
      typename T::const_iterator logical_end)
    {
      const auto sum =
        std::accumulate(v.begin(), logical_end,
          typename T::value_type{});
      const auto sumsq =
        std::inner_product(v.begin(),
        logical_end, v.begin(),
        typename T::value_type{});
      return std::make_pair(sum, sumsq);
    }
    template <typename T>
    auto calc(T & v)
    {
      const auto logical_end =
        remove_outliers(v);
      const auto sums =
        get_sums(v, logical_end);
      const auto n =
        distance(v.begin(), logical_end);
      const auto mean = 
        static_cast<double>(sums.first) / n;
      const auto variance =
        std::accumulate(v.begin(), logical_end,
        0.0, [& mean](auto a, auto b){
          return a + std::pow(b - mean, 2);})
          / (n - 1);
      const auto sd = sqrt(variance);
      return std::make_pair(mean, sd);
    }
  }

Incidentally, when I was experimenting with various ways of rewriting remove_outliers(), I came up with solutions such as that shown below.

  std::vector<int> v;
  const auto [min, max] =
    std::minmax_element(v.begin(), v.end()); 
  std::remove_if(v.begin(), v.end(),
    [min, max](auto element) {
      return element == *min ||
        element == *max;});

This code would not compile with Clang although the following would.

  std::vector<int> v;
  const auto [minimum, maximum] =
    std::minmax_element(v.begin(), v.end());
  const auto min = minimum;
  const auto max = maximum;
  std::remove_if(v.begin(), v.end(),
    [min, max](auto element) {
      return element == *min ||
        element == *max;});

Both versions compiled without complaint with GCC and Microsoft. Does this represent a bug in the Clang compiler?

Editor: this was a troublesome issue where compilers were found to disagree in their interpretation of the wording in C++17. Initially p0588r1 in Nov 2017 made it illegal, pending further discussion; p1091r3 in Feb 2019 removed this restriction. However, these both only apply to the working paper until a new C++ standard is published.

Hans Vredeveld

The first thing we notice is that statistics.h misses an include guard and that it makes use of several standard library functions, but that it does not have any #includes. It should include <algorithm>, <utility> and <cmath>, so that it is self contained and can be used easily in other implementations than statistics.cpp. The include of <cmath> can then be removed from statistics.cpp. (<algorithm> still needs to be included because of the use of std::copy.) Note that an include of <utility> was missing altogether. In the implementations of the standard library that I have installed (GCC 7.4.1â€™s libstdc++, and LLVM 5.0.1â€™s libc++) <algorithm> includes <utility>, and I guess that in other implementations it will be similar. Still, it looked more like an implementation detail to me than anything else, so we should not rely on it. Putting the includes for the projectâ€™s own headers before the includes for third party and standard library headers would have caught at least part of these issues.

In the three function templates in the header there are two recurring issues. First, naming of identifiers. The template parameter in all three templates is T. This doesnâ€™t convey any information about what it represents. When reading the code, it becomes clear that it stands for a container type, so I suggest naming it Container instead. Similar, I suggest renaming the function parameter v to container in all three function templates. In remove_outliers, the lambdaâ€™s parameter v is different from, and shadows the function parameter v. In this case, v stands for the value of a container element, and I suggest renaming it to value.

The other recurring issue is that of taking things by value instead of by reference. The lambda in remove_outliers and the for-loop in get_sums both operate on elements of the container. In the generic case of templates, where we donâ€™t know anything about the element size, it is better to use const references to the elements. If the element size is small, this has no, or a small impact on the generated code (depending on the compiler and compiler options). If the element size is large, this avoids some expensive copying.

get_sums and calc take a container as argument. In all but the most trivial cases this will be a large object that is expensive to copy. get_sums could easily take its argument as const&. For calc, it is a bit more complicated. calc calls remove_outliers, which operates on the container in place. We could have calc take its argument as non-const reference, but then calc would have some unexpected side effects, where calling calc twice with the same container would give different results. We can improve on this by changing remove_outliers so that it doesnâ€™t change its argument in place, but only reads from it and creates a new container with the outliers removed. More about that in a moment.

The body of remove_outliers starts with a using directive for the std namespace. I prefer to write using declarations for the things that I need (min_element, max_element and remove_if), or use them with their fully qualified names. It will save me from the compiler choosing an overload that I didnâ€™t intend to be used, now or in the future when compiling with a new version of the library. Note that the using directive is already useless if we want to use std::min or std::max, because that conflicts with the local variables min and max.

In the next two statements, std::min_element and std::max_element are used to determine the minimum and maximum value. This means going through the container twice. We can reduce this to once by using std::minmax_element. The result of std::minmax_element is a pair of const_iterators and it would be nice to use a structured binding to capture the result. Unfortunately, we cannot use the names introduced by a structured binding in a lambda capture. To circumvent this problem we use std::tie (and include the header <tuple>) with previously declared min and max.

The rest of the function is a single statement that removes the minimum and maximum values in place. As mentioned before, it would be nice if remove_outliers and calc could take their argument as const auto&. We can achieve this by copying all elements we want to keep (using std::copy_if) into a new container and return that from the function. Putting it all together, remove_outliers now becomes

  template <typename Container>
  Container remove_outliers(
    const Container& container)
  {
    typename Container::const_iterator min, max;
    std::tie(min, max) =
      std::minmax_element(container.begin(),
        container.end());
    Container result{};
    std::copy_if(container.begin(),
      container.end(),
      std::back_inserter(result),
      [min, max](const auto& value) {
        return value != *min && value != *max;
      });
    return result;
  }

Thereâ€™s not much to tell about get_sums that isnâ€™t told already, so letâ€™s move on to calc. In the calculations of mean and sd we see repetitive use of sums.first and sums.second. This doesnâ€™t tell us much. If we replace sums.first by sum and sums.second by sumsq, it becomes more informative. To do so, we have to put the result of get_sums in a structured binding [sum, sumsq].

With a template type of std::vector<int>, as is the case in statistics.cpp, sum and sumsq have the type int. The type of n is the implementation defined std::vector<int>::size_type, which in most, if not all, implementations is either unsigned int or unsigned long. In either case the usual arithmetic conversions apply and sum and sumsq are converted from a signed integral type to an unsigned integral type, changing the value of sum if it was negative.

In the calculation of mean, the division is performed on the unsigned operands, after which the result is converted to double. For a proper calculation, the operands have to be converted to double before the division. In the calculation of sd we have a similar issue. One way to solve this is to declare n of type double (alternatively, have get_sums return a pair of doubles). Implicit conversions will take care of the rest. This solves part of the problems. In the calculation of sd the parentheses around n â€“ 1 are missing. Also, because we include <cmath>, we should use std::sqrt instead of sqrt.

With all these changes, calc now becomes

  template <typename Container>
  auto calc(const Container& container)
  {
    auto pruned = remove_outliers(container);
    auto [sum, sumsq] = get_sums(pruned);
    double n = pruned.size();
    double mean = sum / n;
    double sd = std::sqrt((sumsq - 
      sum * sum / n) / (n - 1));
    return std::make_pair(mean, sd);
  }

That covers the technical aspects of the header. We can also make some improvements in the implementation file statistics.cpp. The function read hides its use of std::cin in its internals and fills the vector it gets as argument with values. An interface that better states readâ€™s purpose is

  std::vector<int> read(std::istream& is)

It doesnâ€™t hide any input stream in its internals and it is clear that it produces a vector of ints. Changing read and main is left to the reader.

In main, again, we can use a structured binding [mean, sd] instead of result, improving readability.

Finally, from a functional point we can question the validity of remove_outliers. If we have measurements consisting of 10 times the value 1, 10 times the value 6 and only 4 values in the range 2 to 5, can we really say that all values 1 and 6 are outliers? Or if we have measurements with 100 values in the range 10 to 50, a single value 1 and a single value 2, why is 1 an outlier and 2 not? To answer those questions we need to know more about the data and how it was measured.

Also the code is not robust. calc will go wrong when we have no value or only one value after removing outliers, and read will stop processing input when it sees anything that is not an integer or whitespace.

Balog PÃ¡l

First thing I checked the claim. ☺ Excel indeed provided STDEV 2.16 for the numbers 2...8 and even provided the formula.

Reading the code, I see plenty of problems with efficiency, style and error handling, but for the input in the example nothing thatâ€™s a show-stopper. That leaves the result calculation as the main suspect. The formula I get starts with n* but that can be rearranged by simplifying with /n. After that it looks almost fine, except we need to divide by n-1. And here we divide by n then do a -1. That part should look like / (n-1). Letâ€™s recalculate between the two: 1.73205, square, +1, *n (that is 7), /6, sqrt: 2.1602 that matches the excel figure and what OP expected.

With this small change the code â€˜worksâ€™ for the test input, so weâ€™re done looking at the other things mentioned in the intro.

remove_outliers starts fetching min and max with 2 calls to an std:: algorithm. That is much better than the still usual approach of writing some loop and doing it manually. However, it could be improved by doing it in a single step with minmax_element. Could even use the parallel version.

Then it runs ahead erasing what we found, before making sure we did find something. Iâ€™d bet an overlook and not OPâ€™s careful consideration that it will actually work with empty vector never calling into the lambda. I base the bet on that I have no good idea what is really expected from this function to do in the first place. Currently it does remove all instances of min and max, so a data set with seven 1s and four 2s will be emptied. Maybe it intended just remove one instance of those numbers. Maybe it should not even do anything if we have just 2 or less data points. On my review this would not pass without having proper comments on what the intention is.

get_sums is poorly named, but at least it is evident that it calculates the sum of all elements and their squares. It takes the input by copy for no good reason. While the for loop is pretty straightforward, if we used an algorithm earlier why not do that again, accumulate should be up to this task. And then we could also run it parallelized without effort.

One problem to look for here is overflow. We add up numbers in the vector in the same type. Even the simple sum is suspect to not fit, let alone the sum of squares. The caller of the function uses double for calculation regardless what is in the vector, having sum and sumsq be double instead of element_type, or provided by the caller would be more sensible.

Both in this for and the previous lambda OP used auto. I prefer auto const& as a baseline, especially when we donâ€™t really know what the type is. In this particular case it is likely some numeric type and we only use the value, still just an extra point of fragility.

calc also takes the input by value, though it at least has a reason: it needs a copy that will be stripped of the outliers. Here we run into disaster if the input was empty or became such after stripping as we divide by n.

mean looks fine as written and provides the desired value in the â€˜testâ€™ too, but probably not what we meant to do really: its type is double but the calculation will be done with types int and size_t, losing any fraction. And losing the sign too. I decided to go paper-only with this, those with a compiler at hand may try some runs with a few negative numbers. That would sort-of go away if we used double in sums as was suggested earlier. I say sort-of because the user of doubles should be aware of the precision limit and other quirks attached to floating point types.

For readability sums.first and second are clearly suboptimal, with structured bindings already in C++ we can better say auto const [sum, sumsq] = â€¦ .

Also the naked call sqrt() is suspect. We included <cmath> so it should be std::sqrt() really. We might get away with it, but AFAIK it may not find the function unless we include <math.h>.

In the implementation file we have read that is void and has an _Out_ parameter. Thatâ€™s very last century, in most cases itâ€™s better to return the result and leave the T& only for _InOut_ params. That relieves the headache about what to do if the incoming v is not empty and the caller can make the result const. Also it is not nice to have std::cin hard-wired here, Iâ€™d make that passed in from above.

In main again we could use structured bindings. Also OP should learn the keyword const and use it wherever possible. In the current state statistics.h has no reason to exist, but if it is, should be self-contained with its includes (uses plenty of std::) and multi-inclusion-protected.

That leaves us with one more elephant: the statistics namespace. At this state it has no reason whatsoever to exist. With a single name that belongs there (and 2 more sneaked in). And that name is as awful as it gets. If it is meant as a â€˜componentâ€™, then design it properly, a clear public interface and all the rest hidden from sight. Are the templates needed at all? If so, all of them? And all templated on the collection, instead of, say, iterators?

Finally, a few points that some might point out but that I donâ€™t consider to be issues:

no endl or flush of cout: thatâ€™s supposedly mandated to happen on exit
auto function returns and variables: some people prefer to spell out the type for {reasons} that I donâ€™t find very strong in a work environment.

Marcel MarrÃ© and Jan Eisenhauer

The main issue with CCC 117â€™s code is a misunderstanding of Câ€™s and C++â€™s data type conversion and calculation rules. For both mean and sd, the code in calc employs integer calculations (up to the sqrt in the case of sd). Thus, mean would only ever take integer values, and sd would only ever take the square root of an integer (and the result in the example run is the square root of three).

There are multiple ways to deal with this. get_sums could return a pair<double,double>, explicit double-casts could be added to the code or the line auto n = v.size() could be changed to double n = v.size(). We chose the third option, since it is small and local, makes all denominators doubles and thus forces double divisions.

In addition, the term n-1 in the calculation of the standard deviation requires brackets since otherwise there is no division by n-1 but by n followed by subtraction of 1, followed by the calculation of the square root.

The function remove_outliers also deserves a close look. First off, a semantic question is whether, as the code does, all values equal to the largest and smallest element should be removed from the container. In the case of â€˜pathologicâ€™ samples in which either the minimum or maximum value appears often, the results could be skewed markedly. If only one minimum and maximum should be removed, we do not need remove_if to delete the items:

  auto min = min_element(v.begin(), v.end());
  v.erase(min); 
  auto max = max_element(v.begin(), v.end());
  v.erase(max);

Note that iterators become potentially invalid when the container is changed, so one should not, in this case, get both iterators and then erase each, since this can lead to a segmentation fault, or incorrect results.

If the implemented semantics are correct, the use of iterators within a function changing the container is also problematic. Consider the following two runs, which should lead to the same results, but do not (with the type problems in calc already fixed):

  $ echo 1 1 2 3 4 5 6 8 7 9 9 | statistics
  mean: 5, sd: 2.16025

  $ echo 1 2 3 1 4 5 9 6 8 7 9 | statistics
  mean: 5, sd: 2.73861

Rather than storing the iterator and dereferencing them each time in the lambda, it is better to store the value of the minimum and maximum rather than their position in the container. For better readability, considering v is already used in the outer function, we also suggest renaming the parameter of the lambda to e.g. item.

We would also suggest not using namespace std in remove_outliers, but rather using std::min_element; and using std::max_element. This makes accidental use of identifiers from the std namespace less likely, while still having the advantage (over fully qualified usage) of allowing iterator-specific min_element and max_element functions to be found by argument dependent lookup.

Instead of using min_element and max_element individually, minmax_element will only walk the container once to get iterators to both extrema.

Looking beyond remove_outliers, the whole code is rather fragile, since it does not protect against empty containers either before or after the run of remove_outliers. Even the case of a container with one element (after remove_outliers) will lead to a division by zero in calc (although on g++ 7.4.0 without specific options this leads to nan rather than an exception).

However, the call echo | statistics does lead to a segmentation fault and dumped core. It is a matter of design who should be responsible for relevant checks. As it is, remove_outliers and get_sums are really implementation details for calc, so it might be natural to make the checks in calc.

It is generally advisable for a header to include those headers it requires, whereas in the presented code the source file using the header includes all required headers:

Including the .h file as the very first line of the .c file ensures that no critical piece of information intrinsic to the physical interface of the component is missing from the .h file [â€¦].

Large-scale C++ software design by John Lakos, ISBN 0-201-63362-0, page 111

get_sums could also be replaced with std::accumulate and std::inner_product, although in this case going through the container once in get_sums rather than once in each of the standard algorithms may yield performance benefits.

If the semantically correct choice for remove_outliers is the deletion of only one maximum and minimum each, the code could even be restructured to handle the numbers from std::cin incrementally without storing them in a vector at all, thus allowing huge datasets to be handled with a very moderate (and constant) memory footprint. A detailed description how this could be done is beyond this code critique, especially as it isnâ€™t clear what remove_outliers should really do.

Commentary

I think we had great coverage from our critiques this time so thereâ€™s really very little left to say.

It would, in practice, be important to know more about the use to which this program would be put: what sort of size are the intended data sets and how many times is the program required to be executed. The answers to these questions would help to decide how much emphasis to place on raw performance and on memory reduction.

In my experience, C++ programmers can over-emphasise performance over clarity; although sometimes, of course, you can get both â€“ as several critiques pointed out by suggesting minmax_element.

It was good that the original code had actually been tried out on a known data set (and an error detected), although no-one explicitly suggested ways to improve the test set.

The oddity in the program is remove_outliers â€“ this is a good case where I think in a code review youâ€™d ask for a comment explaining the purpose of the function and, hopefully, providing a link to some broader explanation of the algorithm; without this it is hard to know whether the code is doing the job it is supposed to.

The winner of CC 117

All the entrants identified the two main problems with the code presented. It can be hard to spot this sort of problem in code since the C++ code looks very like the mathematical formula being used and this familiarity can blind us to the errors.

Marcel and Jan also pointed out another, more subtle, error with using the dereferenced values *min and *max in remove_outliers.

Various critiques suggested additional standard algorithms that could be used and also pointed out the sub-optimal use of value and reference semantics in the function calls.

Pal independently verified the claim about the anticipated value from the test dataset, which was a good idea.

It was hard to choose between the critiques, but on reflection I decided to award the prize for this issue to Hansâ€™ critique. Thank you to all who entered!

Code Critique 118

(Submissions to scc@accu.org by August 1st)

Iâ€™m writing code to process sets of address lists, and was hoping to avoid having to make copies of the data structures by using references. However, I canâ€™t get it working â€“ Iâ€™ve produced a stripped down example that reads sample data from a file but it doesnâ€™t work at all.

Sample data:

     Roger Peckham London UK
     John Beaconsfield Bucks UK
     Michael Chicago Illinois USA

Expected output:

addresses, sorted by country and then county

Actual output:

three blank lines!

Listing 3 contains the code.Can you solve the problem â€“ and then give some further advice?

#include <algorithm>
#include <iostream>
#include <iterator>
#include <map>
#include <string>
#include <vector>
struct address /* simplified */
{
  std::string city;
  std::string county;
  std::string country;
};
std::istream& operator>>(std::istream& is,
  address& t)
{
  is >> t.city >> t.county >> t.country;
  return is;
}
std::ostream& operator<<(std::ostream& os,
  address& t)
{
  os << t.city << ' '<< t.county << ' '
     << t.country;
  return os;
}

Listing 3

// sort addresses by country, county, and
// then city
bool operator<(address const &a,
               address const & b)
{
  if (a.country < b.country) return true;
  if (a.country == b.country &&
      a.county < b.county) return true;
  if (a.country == b.country &&
      a.county == b.county &&
      a.city < b.city) return true;
  return false;
}

// This is just for testing, real data
// is supplied differently
auto read(std::istream &is)
{
  std::map<std::string, address> ret;
  while (is)
  {
    std::string name;
    address addr;
    if (is >> name >> addr)
    {
      ret[name] = addr;   
    }
  }
  return ret;
}
  
int main()
{
  std::map<std::string, address> map;
  map = read(std::cin);
  
  // Don't copy the addresses
  std::vector<std::tuple<address&>> addrs;
  for (auto entry : map)
  {
    addrs.push_back(entry.second);
  }

  // sort and print the addresses
  std::sort(addrs.begin(), addrs.end());
  for (auto& v : addrs)
  {
    std::cout << std::get<0>(v) << '\n';
  }
  std::cout << '\n';
}

Listing 4

You can also get the current problem from the accu-general mail list (next entry is posted around the last issueâ€™s deadline) or from the ACCU website (http://accu.org/index.php/journal). This particularly helps overseas members who typically get the magazine much later than members in the UK and Europe.

Roger Orr Roger has been programming for over 20 years, most recently in C++ and Java for various investment banks in Canary Wharf and the City. He joined ACCU in 1999 and the BSI C++ panel in 2002.

Notes:

More fields may be available via dynamicdata ..