Journal Articles

Overload Journal #139 - June 2017 + Journal Editorial
Browse in : All > Journals > Overload > o139 (7)
All > Journal Columns > Editorial (221)
Any of these categories - All of these categories

Note: when you create a new publication type, the articles module will automatically use the templates user-display-[publicationtype].xt and user-summary-[publicationtype].xt. If those templates do not exist when you try to preview or display a new article, you'll get this warning :-) Please place your own templates in themes/yourtheme/modules/articles . The templates will get the extension .xt there.

Title: I am not a number

Author: Martin Moene

Date: 07 June 2017 13:25:43 +01:00 or Wed, 07 June 2017 13:25:43 +01:00

Summary: When is a number not a number? Frances Buontempo counts the ways this happens.

Body: 

Distracted by my sister’s photographs of her recent trip to Portmeirion [Portmeirion] recently, I am reminded of the phrase from the television programme, The Prisoner, which was filmed there; “I am not a number. I am a free man.” In numerical computing we often see data that claims it is not a number, perhaps leaking NaNs to your front end in the process, or littering your log files. I once noticed that the Transport for London website was claiming it took NaN minutes to get from one stop to another. These things should be caught and hidden from your front end. Log files can fill up with NaNs too if you are not careful. They tend to propagate through calculations, like a virus changing all your numbers into complaints. People make mistakes when confronted with these curious beasts, a common one being trying to discover if a number is not in fact a number by trying something like comparing the number against NaN.

  float do_some_maths()
  {
    return NAN;
  }
  int main() {
    float quiet_nan = NAN;
    float answer = do_some_maths();
    if (quiet_nan != answer)
      std::cout << "Is a number\n";
  }

As I hope all our readers know since a NaN is not a number it does not equal any number, furthermore it does not equal itself, so we should in fact compare answer != answer, or better yet, use standard functions like isnan, or double.IsNaN, or math.isnan or similar depending on your language.

There are many different types of NaN, at least in IEEE 754 [IEEE754], including signalling and quiet versions, and a sign, positive or negative, which may or may not mean something. I believe JavaScript just has one NaN, but the function isNaN can be applied to things like "123ABC" wherein it will tell you they are not a number and yet the empty string is a number. The Mozilla developer network [MDN] goes into glorious detail. A ‘more robust’ function, Number.isNaN() exists which indicates if the value is a NaN and its type is numeric, in other words it’s a number that is not a number. No wonder non-technical people have such a hard time understanding what we are talking about.

If you read a file, say a csv with 0s or 1s in the columns, and add it using logstash to an index in elasticsearch 1 you might be surprised if you try to do an aggregation such as sum on a term or field, and are told the data needs to be numeric. How is a 0 or 1 not numeric? What is the world coming to if 0 or 1 is not a number? Mathematics is impossible, at least on a computer, if this is really the case. It turned out, as I expect you have already guessed, that you have to tell logstash to mutate a field if you want it to treat input as numeric rather than as a string. True story, though if I’d read the manual more carefully it would have been apparent in advance. Many newbie mistakes stem from typing a number and ending up with a string. This is unintuitive. When I type 10 I expect it to be 10, not 2 let alone "10". The majority of small children have an idea what a number is, but if you start trying to discuss strings with most adults, you are in danger of talking at cross purposes. And even after an attempt at disambiguation, the description of a small rope might not help. In one case the rope is cordage for tying things together, larger than a string in circumference, and smaller than a cable, while in the other case a rope data structure is a tree (ok, that will send us further down the rabbit hole) made of smaller strings. I am not aware of a cable data structure. Even with a clear idea of needing to be clear when a variable is a number, and being aware of the different numeric data types in your chosen language still leaves space for confusion and mistakes. To the uninitiated,

  int x = 1,000,000;

looks perfectly reasonable, and yet a C++ compiler might complain about expecting an identifier and syntax error: "constant". Say, what?! Ah, a comma is a non-numeric character so cannot be used to initialise a number without first parsing it. Yet we are used to writing numbers with separators to make them easier for us to parse. Of course, the specific character used depends on the locale, which in turn can cause problems if we write something to a file a human wants to look at and then read it back in with a computer. I recently discovered that different locales tend to use different characters as separators in, erm, ‘comma’ separated variable files. You could attempt to take advantage of user defined literals if you wished to express troublesome numbers in your code, or indeed use the C++14 digit separator, to say int x = 1'000'000; instead. Crawl et al provide further details in N3781 [Crawl13]. Java programmers will be laughing at this point, since they use _ instead. Well, from 7 onwards. Commas make things hard to parse, it seems.

Numbers and strings differ, though it is possible to express numbers as strings, and to express some strings as numbers. We should steer clear of different types of strings and numbers otherwise we will be here forever. If you do wish to find the string equivalent of a number, we have already strayed into different locales, though just digit separators. Python draws a distinction between the repr() and str() functions. The former, giving a representation of the object, can be useful for the debugger, and could be passed to the evaluate function eval to rebuild the object. It may also contain the object’s address, thereby ensuring uniqueness between different objects. In contrast, the str() function is designed to be slightly more friendly for humans; maybe adding some formatting to make it easier to read, or dropping extraneous information such as an object’s address. Just to keep you on your toes, if you have a container of objects on which you call str() be aware that result will use repr() on the contained objects. If we return to C++, or C even, one thing which catches people out is pointers. Given int * y suitably initialised, printing y will (probably) yield something very different to printing the contents of y. Many beginners end up printing the address of the pointers rather than that to which it points. Why have I got this number rather than that number? I have not got the right number. Programmers more used to C# or Java may also tend to new objects to raw pointers and print the address by mistake. Representations, locales or idioms, if you will, change as the geography changes. Many young people complain in distress when they progress on from arithmetic in mathematics lessons to algebra, suggesting maths has no right to end up being about letters. As you gain more experience, you realise this abstraction allows you to build up general rules and discover more patterns. Furthermore, it allows programmers to write a function which takes a number using a signature like int forward(int x); The function takes a number, which we will refer to by a letter. Madness. We should clearly use a string like step instead, since single letter parameters or variable names can be a little too terse.

Let us consider integers and some basic maths. What happens when you start with int x = 0; and then add one, over and over. If you do this 100 times, what happens? What about a million? As we know, this rather depends or more specifically the point at which the overflow happens depends on the compiler. Is it 16-bit? 32-bit? 64-bit? Something else? Those who are paying attention will realise that had we started with an unsigned integer we would have been on safer ground, since these wrap round when they overflow, but signed integers overflowing is undefined behaviour. We can never get to +Inf by adding one over and over again. Adding one to any (unsigned) number always yields a number, but possibly a smaller one than you started with. Robert Ramey introduced a safe numerics library in a previous Overload [Ramey] to catch this and related issues. What happens if we initialise our int x with 1 and keep halving? How low can we go? Ah. Perhaps we need to make it a double or float instead. So many numbers, and so many types of numbers. What happens if we try these experiments in different languages? Can you count up to a googol in your chosen language? (Or perhaps multiply up to, since it is rather large). The general point is that the numbers you can express out of the box vary between languages. If you need larger or more precise numbers, you need to find a library, or roll your own representation.

As I am sure I have observed before, in most languages a "literal" constant, such as 5 is put inline in the code. FORTRAN puts 5 in memory which means you can change its value, for details of how see [Gorgonzola]. This is, perhaps, an extreme case of everything being an object, or at least reference. Other languages claim everything is an object. Such languages often have a toString() and hashCode() function for every object, the former of which returns a string, and the latter a number. Presumably these are both also objects, or perhaps immutable value types, or primitive data types. A talk about equality in various languages at the 2011 ACCU conference observed that Java may end up with 1,000 != 1,000, though equals returns true [Orr, Love]. Programming languages are odd. Programmers can be odd too, but we are all humans. We are not numbers, or resources, whatever the project plan or HR department says.

Those who use a flavour of agile may be familiar with story "points". These appear to be numeric, in fact in the sense that they provide some form of order, at least in the sense of saying this is bigger, or smaller, than that. They tend to follow a Fibonacci type sequence, rather than a linear progression to avoid a fuss over whether something is an 11 day or 12 day job. If you can choose from 1, 2, 3, 5, 8, 13 then that attempt at "precision guessing" is circumvented. Once it’s got that big you could argue it’s beyond hope and needs breaking down, before the team has a breakdown. A different set of cards by Lunar Logic are available just consisting of 1, TFB and NFC meaning 1, too flipping big, and no flipping chance [Lunar logic], which many readers will have come across many times before. The inexperienced will often try to convert the story points directly into days or hours, or arrange your story post-it notes into a Gantt chart. They will learn, eventually. The idea of numbers or even symbols being used to order something rather than provide a metric is important. A topology gives you relative positions; think of the London underground map, showing you which stops appear in which order along a line. In contrast, a metric gives you a "distance" (or cost or time or, erm, metric); think of the Paris metro map which shows the distance between the stations. Topologies and metrics can both use numbers, but they mean something very different. Numbers can give us an ordering, a count, a way to compare. They can also give us a description, from how many, to intensity or mass or other ideas with units; scalars versus vectors if you will. Numbers can be useful. They can be misused. They can be represented in various ways. I wonder if you can use user defined literals to cope with Roman numerals? I wish I hadn’t thought of that! People, on the other hand, are neither resources nor numbers. They can be represented by numbers, say in a race, or numbers and letters, for a user id, but that is for simplicity rather than a deeper truth. I am not a number, and I still haven’t written an editorial.

References

[Crawl13] ‘Single quotation mark as a digit separator’ Crawl, Smith, Snyder, Vandervoorde 2013. http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2013/n3781.pdf

[Gorgonzola] https://everything2.com/title/Changing+the+value+of+5+in+FORTRAN

[IEEE754] http://grouper.ieee.org/groups/754/

[Lunar Logic] https://estimation.lunarlogic.io/

[MDN] https://developer.mozilla.org/en/docs/Web/JavaScript/Reference/Global_Objects/isNaN

[Orr, Love] https://accu.org/content/conf2011/Steve-Love-Roger-Orr-equals.pdf with more details in the Overload write up https://accu.org/index.php/journals/1971

[Portmeirion] http://www.portmeirion-village.com/visit/the-prisoner/

[Ramey] ‘Correct Integer Operations with Minimal Runtime Penalties’ Overload 137, Feb 2017 https://accu.org/index.php/journals/2344

Notes: 

More fields may be available via dynamicdata ..