Journal Articles

CVu Journal Vol 31, #3 - July 2019 + Student Code Critiques from CVu journal.
Browse in : All > Journals > CVu > 313 (9)
All > Journal Columns > Code Critique (70)
Any of these categories - All of these categories

Note: when you create a new publication type, the articles module will automatically use the templates user-display-[publicationtype].xt and user-summary-[publicationtype].xt. If those templates do not exist when you try to preview or display a new article, you'll get this warning :-) Please place your own templates in themes/yourtheme/modules/articles . The templates will get the extension .xt there.

Title: Beyond Code Criticism

Author: Bob Schmidt

Date: 04 July 2019 00:53:28 +01:00 or Thu, 04 July 2019 00:53:28 +01:00

Summary: Daniel James considers issues that might lie behind code like that in Code Critique 117.

Body: 

The Code Critique Competition (CCC) is a regular feature of CVu, and has been running for years. In each issue of CVu there’s a new snippet of code that doesn’t work quite as it should – something a beginner might write – and readers are asked to correct the code, and to comment on any other way in which it might be improved. The focus is on the code, the bugs it contains, and the overall quality of coding in the snippet. We’re not asked to look beyond the code provided.

CCC 117, whose solutions appear above, caught my eye in particular for the function named remove_outliers. This function is documented with the single comment – the only comment in the whole code sample – “get rid of the biggest and the smallest”, with no further explanation.

That rang a mental alarm bell for me straight away because I know that the term ‘outlier’ is a specific term in the field of statistics. It doesn’t just refer to the data at the end of the range – ‘the biggest and the smallest’, mentioned in the comment – but rather to any datum that is so different from the rest of the data set as to be incongruous, and so possibly erroneous. I knew this from school maths, more years ago than I care to admit, and I had a vague recollection that the term applied to data points that differed from the mean by more than three standard deviations. I see now from Wikipedia [1] that the picture is rather more complicated, but it is clear, at least, that a given data set may or may not contain values that can be considered to be outliers and if it does they will not be simply the highest and the lowest values in the set.

I also recall from my time as a research scientist that not every surprising measurement is an error. A data value that lies outside the expected range cannot just be ignored – the value should be checked, the experiment repeated, the method of measurement refined, until we can be sure that the value is correct. If it still doesn’t fit the theory perhaps the theory is wrong? That is how scientific advances occur – not by ignoring inconvenient data but by revising theory to fit them.

As guest editor I’ve already had the chance to have a look at the solutions to CCC 177 sent in by readers. All have taken the sample code and its comment at face value, and corrected the code to do what the comment said – which is probably the right response in the context of a competition about code quality. However, as Roger Orr writes in his summary: “… this is a good case where I think in a code review you'd ask for a comment explaining the purpose of the function and, hopefully, providing a link to some broader explanation of the algorithm...”.

Luckily, the code in CCC 117 was made up as a challenge for programmers, it’s not supposed to solve any real-world problems, and it doesn’t actually matter if it discards perfectly reasonable data. It will never have been through a code review as it’s not supposed to be production code – it’s sole purpose is to be reviewed by the readers of CVu for the CCC – and that review really is only about code quality.

What, though, if the code snippet were real code that was supposed to do something meaningful? In that case we’d hope that the programmer would already have asked a lot of searching questions and discovered what was actually required, and would have written code that explained itself better and did the right thing. That’s easier said than done, though – the programmer often doesn’t have access to the people who understand the requirement, and can only ask questions of managers who don’t themselves fully understand the problem. The answer, all too often, is “Don’t overthink it. Just write the code.” This is a serious failure in communication, and a sure-fire way to achieve code that doesn’t do what’s wanted!

Then again, what if the snippet were from a legacy codebase that has been found to produce incorrect outputs? There might be nobody around to ask what it should be doing. We’d then want to look at the specification, and maybe the original requirement. We’d want to know a lot more about the problem that the code was supposed to solve. Without that knowledge we couldn’t begin to say what might possibly be wrong with the code. With luck, everything would be clear from the documentation – if it existed, and if we could get access to it.

It could be something innocuous – perhaps the programmer has used the term ‘outlier’ without knowing its significance in statistics, and there is a perfectly good reason to discard the highest and lowest values. A lot has been written about self-documenting code, and how the choice of good names for the parts of a program can make the code easier to read. This is the opposite: a situation in which an poorly chosen name – chosen poorly because the programmer was unaware of the specific meaning of ‘outlier’ – leads to confusing code. Communication is hard – especially when you don’t know the jargon being used. If this is the case, here, the error is simply one of naming – remove_extremes would be a better name – and we can turn our attention back to improving the implementation. It would still be worthwhile to add a comment about why the extreme values are to be discarded, with a reference to the specification and/or the requirement.

It could be that remove_outliers was intended to remove actual outliers in one of the statistical senses, but that what we see is just a placeholder for actual code that should have replaced it later in development. We’ve all mocked out a function during development, but how many of us have forgotten to remove the mock from production code? It shouldn’t be possible as it should get caught in testing … but this code clearly wasn’t written for testing! If the code was intended as a placeholder, the programmer should certainly have communicated that intention by commenting it as such.

The explanation may, of course, be that remove_outliers just doesn’t do the right thing, and maybe doesn’t do anything sensible at all. The requirement wasn’t communicated effectively to the original programmer, or the programmer didn’t correctly translate that requirement into code – we can’t tell which it is without going back to the requirement. To fix the code we need to fix the failure in communication.

The problem is ultimately a communication problem. We have a function name that suggests a meaning that may not have been intended and a comment that says what the code does rather than why, combined with code that is suspect. We’re left scratching our heads and wondering what it all means. Of course, in the CCC that’s the whole point – to make us think – but production code should tell the whole story and leave no room for guesswork.

Reference

[1] ‘Outlier’ – https://en.wikipedia.org/wiki/Outlier

Daniel James Daniel James left postgraduate chemical research to work in software development. He has worked on small and large systems, on small and smaller computers, and has come to specialize in IT security. His first interest is in making the computer do something useful.

Notes: 

More fields may be available via dynamicdata ..