    <rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:sy="http://purl.org/rss/1.0/modules/syndication/" xmlns:admin="http://webns.net/mvcb/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:content="http://purl.org/rss/1.0/modules/content/">
     <channel>
        <title>ACCU  :: Code Critique Competition 118</title>
        <link>https://members.accu.org/index.php/articles/2671</link>
        <description>Professionalism in Programming</description>
        <dc:language>en-us</dc:language> 
        <dc:creator>Administrator</dc:creator> 
        <admin:generatorAgent rdf:resource="http://www.xaraya.org" /> 
        <admin:errorReportsTo rdf:resource="mailto:webeditor@accu.org" />
       <sy:updatePeriod>hourly</sy:updatePeriod>
       <sy:updateFrequency>1</sy:updateFrequency>
       <docs>http://backend.userland.com/rss</docs>




<div class="xar-mod-head"><span class="xar-mod-title">Student Code Critiques from CVu journal. + CVu Journal Vol 31, #3 - July 2019</span></div>

<table border="0" cellpadding="1" cellspacing="0">
    <tbody>
    <tr>
        <td valign="top">
            Browse in :
       </td>
       <td valign="top">

                                            <a href="https://members.accu.org/index.php/articles/">All</a>

                     &gt;                         <a href="https://members.accu.org/index.php/articles/c184/">Journal Columns</a>

                     &gt;                         <a href="https://members.accu.org/index.php/articles/c183/">Code Critique</a>
<br />

                                            <a href="https://members.accu.org/index.php/articles/">All</a>

                     &gt;                         <a href="https://members.accu.org/index.php/articles/c76/">Journals</a>

                     &gt;                         <a href="https://members.accu.org/index.php/articles/c77/">CVu</a>

                     &gt;                         <a href="https://members.accu.org/index.php/articles/c400/">313</a>
<br />

                                            <a href="https://members.accu.org/index.php/articles/c183-400/">Any of these categories</a>

                    -                        <a href="https://members.accu.org/index.php/articles/c183+400/">All of these categories</a>
<br />
</td>
   </tr>
   </tbody>
</table>




<div class="xar-error">
   <p>
 <strong>Note:</strong> when you create a new publication type,
the articles module will automatically use the templates
<em>user-display-[publicationtype].xt</em>
and <em>user-summary-[publicationtype].xt</em>.
If those templates do not exist when you try to preview or display a new article,
you'll get this warning :-)  Please place your own templates in themes/<em>yourtheme</em>/modules/articles . The templates will get the extension .xt there. </p>
</div>
<div class="xar-norm xar-standard-box-padding">
   <h1><strong>Title:</strong>&nbsp;Code Critique Competition 118</h1>
<p><strong>Author:</strong>&nbsp;Bob Schmidt</p>
<p>
<strong>Date:</strong> 05 July 2019 00:55:29 +01:00 or Fri, 05 July 2019 00:55:29 +01:00</p>
<p><strong>Summary:</strong>&nbsp;Set and collated by Roger Orr. A book prize is awarded for the best entry.</p>
<p><strong>Body:</strong>&nbsp;<p class="EditorIntro">Please note that participation in this competition is open to all members, whether novice or expert. Readers are also encouraged to comment on published entries, and to supply their own possible code samples for the competition (in any common programming language) to scc@accu.org.</p>

<p class="EditorIntro">Note: if you would rather not have your critique visible online please inform me. (Email addresses are not publicly visible.)</p>

<h2>Last issueâ€™s code</h2>

<p class="blockquote">Iâ€™m trying to do some simple statistics for some experimental data but I seem to have got an error: I donâ€™t <em>quite</em> get the results I am expecting for the standard deviation. I canâ€™t see whatâ€™s wrong: Iâ€™ve checked and I donâ€™t get any compiler warnings. I ran a test with a simple data set as follows:</p>

<pre class="programlisting">
     echo 1 2 3 4 5 6 7 8 9 | statistics
     mean: 5, sd: 1.73205</pre>

<p class="blockquote">The mean is right but I think I should be getting a standard deviation of something like 2.16 for the set (without outliers this is [2,8].)</p>

<p>Can you solve the problem â€“ and then give some further advice?</p>

<p>Listing 1 contains  <span class="filename">statistics.h</span> and Listing 2 is <span class="filename">statistics.cpp</span>.</p>

<table class="sidebartable">
	<tr>
		<td>
			<pre class="programlisting">
namespace statistics
{
  // get rid of the biggest and smallest
  template &lt;typename T&gt;
  void remove_outliers(T&amp; v)
  {
    using namespace std;
    auto min = min_element(v.begin(), v.end());
    auto max = max_element(v.begin(), v.end());
    v.erase(remove_if(v.begin(), v.end(),
      [min, max](auto v) {
        return v == *min || v == *max; 
      }),
      v.end());
  }
  template &lt;typename T&gt;
  auto get_sums(T v)
  {
    typename T::value_type sum{}, sumsq{};
    for (auto i : v)
    {
      sum += i;
      sumsq += i * i;
    }
    return std::make_pair(sum, sumsq);
  }
  template &lt;typename T&gt;
  auto calc(T v)
  {
    remove_outliers(v);
    auto sums = get_sums(v);
    auto n = v.size();
    double mean = sums.first / n;
    double sd = sqrt((sums.second -
      sums.first * sums.first / n) / n - 1);
    return std::make_pair(mean, sd);
  }
}
			</pre>
		</td>
	</tr>
	<tr>
		<td class="title">Listing 1</td>
	</tr>
</table>

<table class="sidebartable">
	<tr>
		<td>
			<pre class="programlisting">
#include &lt;algorithm&gt;
#include &lt;cmath&gt;
#include &lt;iostream&gt;
#include &lt;iterator&gt;
#include &lt;vector&gt;
#include &quot;statistics.h&quot;
void read(std::vector&lt;int&gt;&amp; v)
{
  using iter = std::istream_iterator&lt;int&gt;;
  std::copy(iter(std::cin), iter(),
    std::back_inserter(v));
}
int main()
{
  std::vector&lt;int&gt; v;
  read(v);
  auto result = statistics::calc(v);
  std::cout &lt;&lt; &quot;mean: &quot; &lt;&lt; result.first
    &lt;&lt; &quot;, sd: &quot; &lt;&lt; result.second &lt;&lt; '\n';
}
			</pre>
		</td>
	</tr>
	<tr>
		<td class="title">Listing 2</td>
	</tr>
</table>

<h2>Critique</h2>

<h3>Dave Simmonds</h3>

<p><code>remove_outliers</code> will run down the container 3 times, which could be costly on a large dataset â€“ weâ€™ll come back to this later.</p>

<p><code>remove_outliers</code> actually removes the outliers from its input. That sounds OK, and looking at the way <code>calc</code> takes a copy, it works here. But we could be more efficient by taking a const ref and not changing the input. More later.</p>

<p>The lambda in <code>remove_outliers</code> reuses the variable <code>v</code>. This is OK but confusing. We could call it <code>e</code> for element.</p>

<p>In <code>calc</code>, the formula for the standard deviation is the version which means we only have to run down the container once. But as mentioned above weâ€™ve already gone down it three times in <code>remove_outliers</code>, so this could be better. We actually could run down the container only once, keeping track of and counting outliers and adding up <code>sum</code> and <code>sumsq</code>, then removing the amount for the outliers at the end. Code below.</p>

<p>Now for the issue that gives us the wrong result: the formula is trying to divide by n-1, but actually is dividing by n and then subtracting 1. So we need to change <code>n â€“ 1</code> to <code>(n â€“ 1)</code></p>

<p>And now another issue. All our calculations here are based on integers, we will get rounding down to nearest integer â€“ which will give us the answer 2 in the example. We need to convert to <code>double</code> before we do any division.</p>

<p>The line(s) in question becomes:</p>

<pre class="programlisting">
  double sd = sqrt((sums.second -
    (double)(sums.first * sums.first) / n)
    / (n - 1));</pre>
	
<p>This also applies to the mean calculation, which should be:</p>

<pre class="programlisting">
  double mean = (double)sum / n;</pre>
  
<p>The next issue is what if n is less than 2 â€“ we will divide by zero â€“ so there needs to be a check for that.</p>

<p>Hereâ€™s my version of <span class="filename">statistics.h</span>, which addresses all these issues:</p>

<pre class="programlisting">
  namespace statistics
  {
    template &lt;typename T&gt;
    auto calc(T const &amp; v)
    {
      typename T::value_type sum{}, sumsq{};
      typename T::value_type min{}, max{};
      int min_count{}, max_count{};
      // run down the container just once,
      // collecting what we need
      for (auto i : v)
      {
        if (!min_count || i &lt; min)
        {
          min = i;
          min_count = 1;
        }
        else if (i == min)
        {
          ++min_count;
        }
        if (!max_count || i &gt; max)
        {
          max = i;
          max_count = 1;
        }
        else if (i == max)
        {
          ++max_count;
        }
        sum += i;
        sumsq += i * i;
      }
      auto n = v.size();
      // now we remove the outliers
      if (min_count)
      {
        sum -= min*min_count;
        sumsq -= min*min*min_count;
        n -= min_count;
      }
      if (max_count &amp;&amp; max != min) // check if
      // max == min, if so we have already
      // removed it above
      {
        sum -= max*max_count;
        sumsq -= max*max*max_count;
        n -= max_count;
      }
      // check we have enough input
      if (n &lt; 2)
      {
        throw(std::runtime_error(
            &quot;not enough data&quot;));
      }
      // and do the final calcs
      double mean = (double)sum / n;
      double sd = sqrt((sumsq -
        (double)(sum * sum) / n) / (n - 1));
      return std::make_pair(mean, sd);
    }
  }</pre>

<h3>James Holland</h3>

<p>As the student mentions, the compiler issues no warnings. This suggests that there is something wrong with the program logic. It turns out that the line of code that calculates the standard deviation, <code>sd</code>, is in error. Replacing the statement with the following code results in the correct answer for the input provided.</p>

<pre class="programlisting">
  double variance = 0.0;
  for (auto x : v)
  {
    variance += std::pow(x - mean, 2);
  }
  variance /= n - 1;
  double sd = sqrt(variance);</pre>
  
<p>Unfortunately, the program is still not correct as demonstrated by entering the test data again but omitting the first number, (1). Removing the outliers leaves [3 â€¦ 8] which has a mean of 5.5 and a standard deviation of 1.87083. The program gives a mean of 5 and a standard deviation of 1.94936. Something is clearly still wrong. It turns out that the mean is being calculated incorrectly. It is, in effect, being rounded down to the nearest integer. This is because the sum, an integer, is being divided by the number of samples, another integer. The result is also an integer. The integer is then converted to a <code>double</code> and used to initialise <code>mean</code>. But it is too late; the fractional part of the mean has already been lost. The reason why the correct answer was produced with the original test data is that the mean just happened to be an integer value. To correct the problem I suggest casting <code>sums.first</code> to a <code>double</code> before dividing by the number of samples. The program should now work correctly with any input.</p>

<p>The general advice one receives is not to hand-craft loops oneself but to make use of algorithms already provided by the C++ standard library. The original code already uses some library function, such as <code>std::copy()</code>, but letâ€™s see what else the library has to offer.</p>

<p>Instead of using the two library functions <code>min_elenent()</code> and <code>max_element()</code>, the single function <code>minmax_element()</code> can be used with a slight gain in efficiency. This function returns a <code>std::pair </code>of iterators, the first pointing to an element with the smallest value and the second pointing to an element with the largest value. The <code>std::pair</code> of iterators can then be captured by the lambda of the <code>remove_if()</code> function and used by selecting the first and second elements of the pair.</p>

<p>It is possible to get away without actually erasing the â€˜removedâ€™ elements of the vector. We could modify <code>remove_outliers()</code> to return the logical end of the vector as returned by <code>remove_if()</code>. The logical end of the vector can be used in further processing instead of the actual end of the vector. The small amount of execution time gained may be beneficial in some circumstances.</p>

<p>We now turn are attention to <code>get_sums()</code>. This function contains a single loop that calculates the sum and the sum of the squares. While there is nothing disastrously wrong with the code, the loop could be replaced by two library functions as shown in the listing below. Whether using the library makes the code more readable is debatable, however. At least the number of lines of code in the <code>get_sums()</code> function has been reduced from seven to three.</p>

<p>It is also possible to make use of the standard library in calculating the standard deviation, <code>sd</code>. Instead of employing the code shown at the beginning of this text, the standard library function <code>accumulate()</code> can be used to calculate the square of the standard deviation. This value is called the variance. Obtaining the square root of the variance, therefore, gives the standard deviation as shown in the complete listing of my version of the <span class="filename">statistics.h</span>.</p>

<pre class="programlisting">
  #include &lt;algorithm&gt;
  #include &lt;iostream&gt;
  #include &lt;numeric&gt;
  namespace statistics
  {
    // Get rid of the biggest and smallest.
    template &lt;typename T&gt;
    auto remove_outliers(T &amp; v)
    {
      const auto min_max =
        std::minmax_element(v.begin(), v.end());
      return std::remove_if(v.begin(), v.end(),
       [min_max](auto element) {
       return element == *min_max.first ||
         element == *min_max.second;});
    }
    template &lt;typename T&gt;
    auto get_sums(const T &amp; v,
      typename T::const_iterator logical_end)
    {
      const auto sum =
        std::accumulate(v.begin(), logical_end,
          typename T::value_type{});
      const auto sumsq =
        std::inner_product(v.begin(),
        logical_end, v.begin(),
        typename T::value_type{});
      return std::make_pair(sum, sumsq);
    }
    template &lt;typename T&gt;
    auto calc(T &amp; v)
    {
      const auto logical_end =
        remove_outliers(v);
      const auto sums =
        get_sums(v, logical_end);
      const auto n =
        distance(v.begin(), logical_end);
      const auto mean = 
        static_cast&lt;double&gt;(sums.first) / n;
      const auto variance =
        std::accumulate(v.begin(), logical_end,
        0.0, [&amp; mean](auto a, auto b){
          return a + std::pow(b - mean, 2);})
          / (n - 1);
      const auto sd = sqrt(variance);
      return std::make_pair(mean, sd);
    }
  }</pre>
  
<p>Incidentally, when I was experimenting with various ways of rewriting <code>remove_outliers()</code>, I came up with solutions such as that shown below.</p>

<pre class="programlisting">
  std::vector&lt;int&gt; v;
  const auto [min, max] =
    std::minmax_element(v.begin(), v.end()); 
  std::remove_if(v.begin(), v.end(),
    [min, max](auto element) {
      return element == *min ||
        element == *max;});</pre>
		
<p>This code would not compile with Clang although the following would.</p>

<pre class="programlisting">
  std::vector&lt;int&gt; v;
  const auto [minimum, maximum] =
    std::minmax_element(v.begin(), v.end());
  const auto min = minimum;
  const auto max = maximum;
  std::remove_if(v.begin(), v.end(),
    [min, max](auto element) {
      return element == *min ||
        element == *max;});</pre>
		
<p>Both versions compiled without complaint with GCC and Microsoft. Does this represent a bug in the Clang compiler?</p>

<p class="EditorIntro">Editor: this was a troublesome issue where compilers were found to disagree in their interpretation of the wording in C++17. Initially p0588r1 in Nov 2017 made it illegal, pending further discussion; p1091r3 in Feb 2019 removed this restriction. However, these both only apply to the working paper until a new C++ standard is published.</p>

<h3>Hans Vredeveld</h3>

<p>The first thing we notice is that <span class="filename">statistics.h</span> misses an include guard and that it makes use of several standard library functions, but that it does not have any <code>#include</code>s. It should include <code>&lt;algorithm&gt;</code>, <code>&lt;utility&gt;</code> and <code>&lt;cmath&gt;</code>, so that it is self contained and can be used easily in other implementations than <span class="filename">statistics.cpp</span>. The include of &lt;cmath&gt; can then be removed from <span class="filename">statistics.cpp</span>. (<code>&lt;algorithm&gt;</code> still needs to be included because of the use of <code>std::copy</code>.) Note that an include of <code>&lt;utility&gt;</code> was missing altogether. In the implementations of the standard library that I have installed (GCC 7.4.1â€™s libstdc++, and LLVM 5.0.1â€™s libc++) <code>&lt;algorithm&gt;</code> includes <code>&lt;utility&gt;</code>, and I guess that in other implementations it will be similar. Still, it looked more like an implementation detail to me than anything else, so we should not rely on it. Putting the includes for the projectâ€™s own headers before the includes for third party and standard library headers would have caught at least part of these issues.</p>

<p>In the three function templates in the header there are two recurring issues. First, naming of identifiers. The template parameter in all three templates is <code>T</code>. This doesnâ€™t convey any information about what it represents. When reading the code, it becomes clear that it stands for a container type, so I suggest naming it <code>Container</code> instead. Similar, I suggest renaming the function parameter <code>v</code> to container in all three function templates. In <code>remove_outliers</code>, the lambdaâ€™s parameter <code>v</code> is different from, and shadows the function parameter <code>v</code>. In this case, <code>v</code> stands for the value of a container element, and I suggest renaming it to <code>value</code>.</p>

<p>The other recurring issue is that of taking things by value instead of by reference. The lambda in <code>remove_outliers</code> and the <code>for</code>-loop in <code>get_sums</code> both operate on elements of the container. In the generic case of templates, where we donâ€™t know anything about the element size, it is better to use const references to the elements. If the element size is small, this has no, or a small impact on the generated code (depending on the compiler and compiler options). If the element size is large, this avoids some expensive copying.</p>

<p><code>get_sums</code> and <code>calc</code> take a container as argument. In all but the most trivial cases this will be a large object that is expensive to copy. <code>get_sums</code> could easily take its argument as <code>const&amp;</code>. For <code>calc</code>, it is a bit more complicated. <code>calc</code> calls <code>remove_outliers</code>, which operates on the container in place. We could have <code>calc</code> take its argument as non-const reference, but then <code>calc</code> would have some unexpected side effects, where calling <code>calc</code> twice with the same container would give different results. We can improve on this by changing <code>remove_outliers</code> so that it doesnâ€™t change its argument in place, but only reads from it and creates a new container with the outliers removed. More about that in a moment.</p>

<p>The body of <code>remove_outliers</code> starts with a using directive for the <code>std</code> namespace. I prefer to write using declarations for the things that I need (<code>min_element</code>, <code>max_element</code> and <code>remove_if</code>), or use them with their fully qualified names. It will save me from the compiler choosing an overload that I didnâ€™t intend to be used, now or in the future when compiling with a new version of the library. Note that the <code>using</code> directive is already useless if we want to use <code>std::min</code> or <code>std::max</code>, because that conflicts with the local variables <code>min</code> and <code>max</code>.</p>

<p>In the next two statements, <code>std::min_element</code> and <code>std::max_element</code> are used to determine the minimum and maximum value. This means going through the container twice. We can reduce this to once by using <code>std::minmax_element</code>. The result of <code>std::minmax_element</code> is a pair of <code>const_iterator</code>s and it would be nice to use a structured binding to capture the result. Unfortunately, we cannot use the names introduced by a structured binding in a lambda capture. To circumvent this problem we use <code>std::tie</code> (and include the header <code>&lt;tuple&gt;</code>) with previously declared <code>min</code> and <code>max</code>.</p>

<p>The rest of the function is a single statement that removes the minimum and maximum values in place. As mentioned before, it would be nice if <code>remove_outliers</code> and <code>calc</code> could take their argument as <code>const auto&amp;</code>. We can achieve this by copying all elements we want to keep (using <code>std::copy_if</code>) into a new container and return that from the function. Putting it all together, <code>remove_outliers</code> now becomes</p>

<pre class="programlisting">
  template &lt;typename Container&gt;
  Container remove_outliers(
    const Container&amp; container)
  {
    typename Container::const_iterator min, max;
    std::tie(min, max) =
      std::minmax_element(container.begin(),
        container.end());
    Container result{};
    std::copy_if(container.begin(),
      container.end(),
      std::back_inserter(result),
      [min, max](const auto&amp; value) {
        return value != *min &amp;&amp; value != *max;
      });
    return result;
  }</pre>
  
<p>Thereâ€™s not much to tell about <code>get_sums</code> that isnâ€™t told already, so letâ€™s move on to calc. In the calculations of <code>mean</code> and <code>sd</code> we see repetitive use of <code>sums.first</code> and <code>sums.second</code>. This doesnâ€™t tell us much. If we replace <code>sums.first</code> by <code>sum</code> and <code>sums.second</code> by <code>sumsq</code>, it becomes more informative. To do so, we have to put the result of <code>get_sums</code> in a structured binding <code>[sum, sumsq]</code>.</p>

<p>With a template type of <code>std::vector&lt;int&gt;</code>, as is the case in <span class="filename">statistics.cpp</span>, <code>sum</code> and <code>sumsq</code> have the type <code>int</code>. The type of <code>n</code> is the implementation defined <code>std::vector&lt;int&gt;::size_type</code>, which in most, if not all, implementations is either <code>unsigned int</code> or <code>unsigned long</code>. In either case the usual arithmetic conversions apply and <code>sum</code> and <code>sumsq</code> are converted from a signed integral type to an unsigned integral type, changing the value of <code>sum</code> if it was negative.</p>

<p>In the calculation of <code>mean</code>, the division is performed on the unsigned operands, after which the result is converted to <code>double</code>. For a proper calculation, the operands have to be converted to <code>double</code> before the division. In the calculation of <code>sd</code> we have a similar issue. One way to solve this is to declare <code>n</code> of type <code>double</code> (alternatively, have <code>get_sums</code> return a pair of <code>double</code>s). Implicit conversions will take care of the rest. This solves part of the problems. In the calculation of <code>sd</code> the parentheses around n â€“ 1 are missing. Also, because we include <code>&lt;cmath&gt;</code>, we should use <code>std::sqrt</code> instead of <code>sqrt</code>.</p>

<p>With all these changes, <code>calc</code> now becomes</p>

<pre class="programlisting">
  template &lt;typename Container&gt;
  auto calc(const Container&amp; container)
  {
    auto pruned = remove_outliers(container);
    auto [sum, sumsq] = get_sums(pruned);
    double n = pruned.size();
    double mean = sum / n;
    double sd = std::sqrt((sumsq - 
      sum * sum / n) / (n - 1));
    return std::make_pair(mean, sd);
  }</pre>
  
<p>That covers the technical aspects of the header. We can also make some improvements in the implementation file <span class="filename">statistics.cpp</span>. The function read hides its use of <code>std::cin</code> in its internals and fills the vector it gets as argument with values. An interface that better states <code>read</code>â€™s purpose is</p>

<pre class="programlisting">
  std::vector&lt;int&gt; read(std::istream&amp; is)</pre>
  
<p>It doesnâ€™t hide any input stream in its internals and it is clear that it produces a vector of <code>int</code>s. Changing <code>read</code> and <code>main</code> is left to the reader.</p>

<p>In <code>main</code>, again, we can use a structured binding <code>[mean, sd]</code> instead of <code>result</code>, improving readability.</p>

<p>Finally, from a functional point we can question the validity of <code>remove_outliers</code>. If we have measurements consisting of 10 times the value 1, 10 times the value 6 and only 4 values in the range 2 to 5, can we really say that all values 1 and 6 are outliers? Or if we have measurements with 100 values in the range 10 to 50, a single value 1 and a single value 2, why is 1 an outlier and 2 not? To answer those questions we need to know more about the data and how it was measured.</p>

<p>Also the code is not robust. <code>calc</code> will go wrong when we have no value or only one value after removing outliers, and <code>read</code> will stop processing input when it sees anything that is not an integer or whitespace.</p>

<h3>Balog PÃ¡l </h3>

<p>First thing I checked the claim. &#x263A; Excel indeed provided STDEV 2.16 for the numbers 2...8 and even provided the formula. </p>

<p>Reading the code, I see plenty of problems with efficiency, style and error handling, but for the input in the example nothing thatâ€™s a show-stopper. That leaves the <code>result</code> calculation as the main suspect. The formula I get starts with <code>n*</code> but that can be rearranged by simplifying with <code>/n</code>. After that it looks almost fine, except we need to divide by<code> n-1</code>. And here we divide by <code>n</code> then do a <code>-1</code>. That part should look like <code>/ (n-1)</code>. Letâ€™s recalculate between the two: 1.73205, square, +1, *n (that is 7), /6, sqrt: 2.1602 that matches the excel figure and what OP expected.</p>

<p>With this small change the code â€˜worksâ€™ for the test input, so weâ€™re <span style="text-decoration:line-through;">done</span> looking at the other things mentioned in the intro.</p>
 
<p><code>remove_outliers</code> starts fetching <code>min</code> and <code>max</code> with 2 calls to an <code>std:: algorithm</code>. That is much better than the still usual approach of writing some loop and doing it manually. However, it could be improved by doing it in a single step with <code>minmax_element</code>. Could even use the parallel version.</p>

<p>Then it runs ahead erasing what we found, before making sure we did find something. Iâ€™d bet an overlook and not OPâ€™s careful consideration that it will actually work with empty vector never calling into the lambda. I base the bet on that I have no good idea what is really expected from this function to do in the first place. Currently it does remove all instances of <code>min</code> and <code>max</code>, so a data set with seven 1s and four 2s will be emptied. Maybe it intended just remove one instance of those numbers. Maybe it should not even do anything if we have just 2 or less data points.  On my review this would not pass without having proper comments on what the intention is.</p>

<p><code>get_sums</code> is poorly named, but at least it is evident that it calculates the sum of all elements and their squares. It takes the input by copy for no good reason. While the for loop is pretty straightforward, if we used an algorithm earlier why not do that again, <code>accumulate</code> should be up to this task. And then we could also run it parallelized without effort. </p>

<p>One problem to look for here is overflow. We add up numbers in the vector in the same type. Even the simple sum is suspect to not fit, let alone the sum of squares. The caller of the function uses <code>double</code> for calculation regardless what is in the vector, having <code>sum</code> and <code>sumsq</code> be <code>double</code> instead of <code>element_type</code>, or provided by the caller would be more sensible. </p>

<p>Both in this for and the previous lambda OP used <code>auto</code>. I prefer <code>auto const&amp;</code> as a baseline, especially when we donâ€™t really know what the type is. In this particular case it is likely some numeric type and we only use the value, still just an extra point of fragility.</p>

<p><code>calc</code> also takes the input by value, though it at least has a reason: it needs a copy that will be stripped of the outliers. Here we run into disaster if the input was empty or became such after stripping as we divide by <code>n</code>. </p>

<p><code>mean</code> looks fine as written and provides the desired value in the â€˜testâ€™ too, but probably not what we meant to do really: its type is <code>double</code> but the calculation will be done with types <code>int</code> and <code>size_t</code>, losing any fraction. And losing the sign too. I decided to go paper-only with this, those with a compiler at hand may try some runs with a few negative numbers. That would sort-of go away if we used <code>double</code> in sums as was suggested earlier. I say sort-of because the user of <code>double</code>s should be aware of the precision limit and other quirks attached to floating point types.</p>

<p>For readability <code>sums.first </code>and <code>second</code> are clearly suboptimal, with structured bindings already in C++ we can better say <code>auto const [sum, sumsq] = â€¦</code> .</p>

<p>Also the naked call <code>sqrt() </code>is suspect. We included <code>&lt;cmath&gt;</code> so it should be <code>std::sqrt() </code>really. We might get away with it, but AFAIK it may not find the function unless we include <code>&lt;math.h&gt;</code>.</p>

<p>In the implementation file we have <code>read</code> that is <code>void</code> and has an _Out_ parameter. Thatâ€™s very last century, in most cases itâ€™s better to <code>return</code> the result and leave the <code>T&amp;</code> only for _InOut_ params. That relieves the headache about what to do if the incoming <code>v</code> is not empty and the caller can make the result <code>const</code>. Also it is not nice to have <code>std::cin</code> hard-wired here, Iâ€™d make that passed in from above.</p>

<p>In <code>main</code> again we could use structured bindings. Also OP should learn the keyword <code>const</code> and use it wherever possible. In the current state <span class="filename">statistics.h</span> has no reason to exist, but if it is, should be self-contained with its includes (uses plenty of <code>std::</code>) and multi-inclusion-protected.</p>

<p>That leaves us with one more elephant: the <code>statistics</code> namespace. At this state it has no reason whatsoever to exist. With a single name that belongs there (and 2 more sneaked in). And that name is as awful as it gets.  If it is meant as a â€˜componentâ€™, then design it properly, a clear public interface and all the rest hidden from sight.   Are the templates needed at all? If so, all of them? And all templated on the collection, instead of, say, iterators?</p>

<p>Finally, a few points that some might point out but that I donâ€™t consider to be issues:</p>

<ul>
	<li>no <code>endl</code> or <code>flush</code> of <code>cout</code>: thatâ€™s supposedly mandated to happen on exit</li>
	<li><code>auto</code> function returns and variables: some people prefer to spell out the type for {reasons} that I donâ€™t find very strong in a work environment.</li>
</ul>

<h3>Marcel MarrÃ©  and Jan Eisenhauer</h3>

<p>The main issue with CCC 117â€™s code is a misunderstanding of Câ€™s and C++â€™s data type conversion and calculation rules. For both <code>mean</code> and <code>sd</code>, the code in <code>calc</code> employs integer calculations (up to the <code>sqrt</code> in the case of <code>sd</code>). Thus, <code>mean</code> would only ever take integer values, and <code>sd</code> would only ever take the square root of an integer (and the result in the example run is the square root of three).</p>

<p>There are multiple ways to deal with this. <code>get_sums</code> could return a <code>pair&lt;double,double&gt;</code>, explicit <code>double</code>-casts could be added to the code or the line <code>auto n = v.size()</code> could be changed to <code>double n = v.size()</code>. We chose the third option, since it is small and local, makes all denominators <code>double</code>s and thus forces <code>double</code> divisions.</p>

<p>In addition, the term <code>n-1</code> in the calculation of the standard deviation requires brackets since otherwise there is no division by <code>n-1</code> but by <code>n</code> followed by subtraction of <code>1</code>, followed by the calculation of the square root.</p>

<p>The function <code>remove_outliers</code> also deserves a close look. First off, a semantic question is whether, as the code does, all values equal to the largest and smallest element should be removed from the container. In the case of â€˜pathologicâ€™ samples in which either the minimum or maximum value appears often, the results could be skewed markedly. If only one minimum and maximum should be removed, we do not need <code>remove_if</code> to delete the items:</p>

<pre class="programlisting">
  auto min = min_element(v.begin(), v.end());
  v.erase(min); 
  auto max = max_element(v.begin(), v.end());
  v.erase(max);</pre>
  
<p>Note that iterators become potentially invalid when the container is changed, so one should not, in this case, get both iterators and then erase each, since this can lead to a segmentation fault, or incorrect results.</p>

<p>If the implemented semantics are correct, the use of iterators within a function changing the container is also problematic. Consider the following two runs, which should lead to the same results, but do not (with the type problems in <code>calc</code> already fixed):</p>

<pre class="programlisting">
  $ echo 1 1 2 3 4 5 6 8 7 9 9 | statistics
  mean: 5, sd: 2.16025

  $ echo 1 2 3 1 4 5 9 6 8 7 9 | statistics
  mean: 5, sd: 2.73861</pre>
  
<p>Rather than storing the iterator and dereferencing them each time in the lambda, it is better to store the value of the minimum and maximum rather than their position in the container. For better readability, considering <code>v</code> is already used in the outer function, we also suggest renaming the parameter of the lambda to e.g. <code>item</code>.</p>

<p>We would also suggest not <code>using namespace std</code> in <code>remove_outliers</code>, but rather <code>using std::min_element</code>; and <code>using std::max_element</code>. This makes accidental use of identifiers from the <code>std</code> namespace less likely, while still having the advantage (over fully qualified usage) of allowing iterator-specific <code>min_element</code> and <code>max_element</code> functions to be found by argument dependent lookup.</p>

<p>Instead of using <code>min_element</code> and <code>max_element</code> individually, <code>minmax_element</code> will only walk the container once to get iterators to both extrema.</p>
<p>Looking beyond <code>remove_outliers</code>, the whole code is rather fragile, since it does not protect against empty containers either before or after the run of <code>remove_outliers</code>. Even the case of a container with one element (after <code>remove_outliers</code>) will lead to a division by zero in <code>calc</code> (although on g++ 7.4.0 without specific options this leads to nan rather than an exception).</p>

<p>However, the call <code>echo | statistics</code> does lead to a segmentation fault and dumped core. It is a matter of design who should be responsible for relevant checks. As it is, <code>remove_outliers</code> and <code>get_sums</code> are really implementation details for <code>calc</code>, so it might be natural to make the checks in <code>calc</code>.</p>

<p>It is generally advisable for a header to include those headers it requires, whereas in the presented code the source file using the header includes all required headers:</p>

<p class="blockquote">Including the <span class="filename">.h</span> file as the very first line of the <span class="filename">.c</span> file ensures that no critical piece of information intrinsic to the physical interface of the component is missing from the <span class="filename">.h</span> file [â€¦].</p>

<p class="blockquote"><em>Large-scale C++ software design</em> by John Lakos, ISBN 0-201-63362-0, page 111</p>

<p><code>get_sums</code> could also be replaced with <code>std::accumulate</code> and <code>std::inner_product</code>, although in this case going through the container once in <code>get_sums</code> rather than once in each of the standard algorithms may yield performance benefits.</p>

<p>If the semantically correct choice for <code>remove_outliers</code> is the deletion of only one maximum and minimum each, the code could even be restructured to handle the numbers from <code>std::cin</code> incrementally without storing them in a vector at all, thus allowing huge datasets to be handled with a very moderate (and constant) memory footprint. A detailed description how this could be done is beyond this code critique, especially as it isnâ€™t clear what <code>remove_outliers</code> should really do. </p>

<h2>Commentary</h2>

<p>I think we had great coverage from our critiques this time so thereâ€™s really very little left to say.</p>

<p>It would, in practice, be important to know more about the use to which this program would be put: what sort of size are the intended data sets and how many times is the program required to be executed. The answers to these questions would help to decide how much emphasis to place on raw performance and on memory reduction.</p>

<p>In my experience, C++ programmers can over-emphasise performance over clarity; although sometimes, of course, you can get both â€“ as several critiques pointed out by suggesting <code>minmax_element</code>.</p>

<p>It was good that the original code had actually been tried out on a known data set (and an error detected), although no-one explicitly suggested ways to improve the test set.</p>

<p>The oddity in the program is <code>remove_outliers</code> â€“ this is a good case where I think in a code review youâ€™d ask for a comment explaining the purpose of the function and, hopefully, providing a link to some broader explanation of the algorithm; without this it is hard to know whether the code is doing the job it is supposed to.</p>

<h2>The winner of CC 117</h2>

<p>All the entrants identified the two main problems with the code presented. It can be hard to spot this sort of problem in code since the C++ code looks very like the mathematical formula being used and this familiarity can blind us to the errors.</p>

<p>Marcel and Jan also pointed out another, more subtle, error with using the dereferenced values <code>*min</code> and <code>*max</code> in <code>remove_outliers</code>. </p>

<p>Various critiques suggested additional standard algorithms that could be used and also pointed out the sub-optimal use of value and reference semantics in the function calls.</p>

<p>Pal independently verified the claim about the anticipated value from the test dataset, which was a good idea.</p>

<p>It was hard to choose between the critiques, but on reflection I decided to award the prize for this issue to Hansâ€™ critique. Thank you to all who entered!</p>

<h2>Code Critique 118</h2>

<p>(Submissions to scc@accu.org by August 1st)</p>

<p class="blockquote">Iâ€™m writing code to process sets of address lists, and was hoping to avoid having to make copies of the data structures by using references. However, I canâ€™t get it working â€“ Iâ€™ve produced a stripped down example that reads sample data from a file but it doesnâ€™t work at all.</p>

<p class="blockquote">Sample data:</p>

<pre class="programlisting">
     Roger Peckham London UK
     John Beaconsfield Bucks UK
     Michael Chicago Illinois USA</pre>
	 
<p class="blockquote">Expected output:</p>

<p class="blockquote">    addresses, sorted by country and then county</p>

<p class="blockquote">Actual output:</p>

<p class="blockquote">   three blank lines!</p>

<p>Listing 3 contains the code.Can you solve the problem â€“ and then give some further advice?</p>

<table class="sidebartable">
	<tr>
		<td>
			<pre class="programlisting">
#include &lt;algorithm&gt;
#include &lt;iostream&gt;
#include &lt;iterator&gt;
#include &lt;map&gt;
#include &lt;string&gt;
#include &lt;vector&gt;
struct address /* simplified */
{
  std::string city;
  std::string county;
  std::string country;
};
std::istream&amp; operator&gt;&gt;(std::istream&amp; is,
  address&amp; t)
{
  is &gt;&gt; t.city &gt;&gt; t.county &gt;&gt; t.country;
  return is;
}
std::ostream&amp; operator&lt;&lt;(std::ostream&amp; os,
  address&amp; t)
{
  os &lt;&lt; t.city &lt;&lt; ' '&lt;&lt; t.county &lt;&lt; ' '
     &lt;&lt; t.country;
  return os;
}
			</pre>
		</td>
	</tr>
	<tr>
		<td class="title">Listing 3</td>
	</tr>
</table>

<table class="sidebartable">
	<tr>
		<td>
			<pre class="programlisting">
// sort addresses by country, county, and
// then city
bool operator&lt;(address const &amp;a,
               address const &amp; b)
{
  if (a.country &lt; b.country) return true;
  if (a.country == b.country &amp;&amp;
      a.county &lt; b.county) return true;
  if (a.country == b.country &amp;&amp;
      a.county == b.county &amp;&amp;
      a.city &lt; b.city) return true;
  return false;
}

// This is just for testing, real data
// is supplied differently
auto read(std::istream &amp;is)
{
  std::map&lt;std::string, address&gt; ret;
  while (is)
  {
    std::string name;
    address addr;
    if (is &gt;&gt; name &gt;&gt; addr)
    {
      ret[name] = addr;   
    }
  }
  return ret;
}
  
int main()
{
  std::map&lt;std::string, address&gt; map;
  map = read(std::cin);
  
  // Don't copy the addresses
  std::vector&lt;std::tuple&lt;address&amp;&gt;&gt; addrs;
  for (auto entry : map)
  {
    addrs.push_back(entry.second);
  }

  // sort and print the addresses
  std::sort(addrs.begin(), addrs.end());
  for (auto&amp; v : addrs)
  {
    std::cout &lt;&lt; std::get&lt;0&gt;(v) &lt;&lt; '\n';
  }
  std::cout &lt;&lt; '\n';
}
			</pre>
		</td>
	</tr>
	<tr>
		<td class="title">Listing 4</td>
	</tr>
</table>
<p>You can also get the current problem from the accu-general mail list (next entry is posted around the last issueâ€™s deadline) or from the ACCU website (<a href="http://accu.org/index.php/journal">http://accu.org/index.php/journal</a>). This particularly helps overseas members who typically get the magazine much later than members in the UK and Europe.</p>

<p class="bio"><span class="author"><b>Roger Orr</b></span> Roger has been programming for over 20 years, most recently in C++ and Java for various investment banks in Canary Wharf and the City. He joined ACCU in 1999 and the BSI C++ panel in 2002.</p>
</p>
<p><strong>Notes:</strong>&nbsp;</p>
<p><em>More fields may be available via dynamicdata ..</em></p>
</div>
</channel>
</rss>
