Journal Articles
Browse in : |
All
> Journals
> CVu
> 274
(13)
All > Topics > Programming (877) Any of these categories - All of these categories |
Note: when you create a new publication type, the articles module will automatically use the templates user-display-[publicationtype].xt and user-summary-[publicationtype].xt. If those templates do not exist when you try to preview or display a new article, you'll get this warning :-) Please place your own templates in themes/yourtheme/modules/articles . The templates will get the extension .xt there.
Title: Code Critique Competition 95
Author: Martin Moene
Date: 04 September 2015 07:20:30 +01:00 or Fri, 04 September 2015 07:20:30 +01:00
Summary: Set and collated by Roger Orr. A book prize is awarded for the best entry.
Body:
Participation in this competition is open to all members, whether novice or expert. Readers are also encouraged to comment on published entries, and to supply their own possible code samples for the competition (in any common programming language) to scc@accu.org.
Note: If you would rather not have your critique visible online, please inform me. (We will remove email addresses!)
Last issue’s code
I had some code that used an anonymous structure within an anonymous union to easily refer to the high and low 32-bit parts of a 64-bit pointer. However, I get a warning that this is non-portable (I'm not quite sure why - MSVC and g++ both accept it) but after googling around for a solution I found one that uses #define. It all compiles without warnings now so I think it's fixed.
Can you give some advice to help this programmer?
The code is in Listing 1.
--- address.h --- #pragma once typedef struct { union { struct { int32_t offsetLo; int32_t offsetHi; } s; void *pointer; }; } address; // simulate anonymous structs with #define #define offsetLo s.offsetLo #define offsetHi s.offsetHi --- test program --- #include <cstdint> #include <iostream> #include "address.h" int main() { int var = 12; address a; a.pointer = &var; std::cout << "Address = " << std::hex << a.offsetHi << "/" << a.offsetLo << std::endl; } |
Listing 1 |
Critiques
James Holland
The student says that warnings are issued but not when using MSVS or G++. It would appear that the student’s compiler is trying to warn of something that the other compilers are keeping quiet about. It is a pity that the student does not tell us what the warnings are. In their absence, it is worth thinking about what the warnings could be. This may provide some clues as to why the code may not be portable.
It may be the case that the compiler is generating pointers of length other than 64 bits. Without instructing the compiler otherwise, it might well produce 32-bit pointers. This would have the effect that when the address of var
is assigned to a.pointer
, offsetHi
would not get assigned a value. This is because a.pointer is only of sufficient size to cover offsetLo
. The result is that offsetHi
would be left with an undefined value. So, if the compiler is generating 32-bit pointers, it would be appropriate for a warning to be issued to the effect that offsetHi
is never initialised. We need to find a way to guard against this scenario.
Probably the most direct way is to check at compile-time that the pointer is 64 bits in length. This can be achieved using the following static_assert
statement.
static_assert(sizeof address::pointer == 8, "Pointers are not 64-bit!");
If the compiler is not generating 64-bit pointers, the error message will be issued when the program is compiled thus preventing executable code from being produced.
What else could go wrong? Well, for the output of the program to be interpreted correctly, it is essential that offsetLo
is located on the least significant half of a.pointer and that offsetHi
is located on the most significant half. There are two ways in which machines store multi-byte values; least significant bytes first followed by the more significant bytes or most significant bytes first followed by the lesser significant bytes. The former method is known as little endian, the latter is known as big endian. As an example, the table below shows the two types of endian for the value 0x12345678.
Endian | Big | Little | |||||||
---|---|---|---|---|---|---|---|---|---|
Byte value | 12 | 34 | 56 | 78 | 78 | 56 | 34 | 12 | |
Byte offset | 0 | 1 | 2 | 3 | 0 | 1 | 2 | 3 |
From the supplied code, it can be seen that the student is expecting the software to be run on a little endian machine as offsetLo
has been declared before offsetHo
and so offsetLo
will be located at the lower memory address. The following code could be used to determine the endian of a particular machine.
a.address = reinterpret_cast<int *>(1); if (a.offsetLo == 1) { // Little endian ... } else { // Big endian ... }
I leave it to the student as to how the code is used in a practical program.
It is conceivable that the compiler will place some padding bytes between offsetLo
and offsetHi
. This will depend on the settings of the compiler and on machine for which the software is designed to run. For the program to work correctly there must be no padding. It is possible to test for the presence of padding by use of the following code.
const auto x = reinterpret_cast<size_t>(&a.offsetLo); const auto y = sizeof address::offsetLo; const auto z = reinterpret_cast<size_t>(&a.offsetHi); if (x + y != z) { std::cout << "offsetLo and offsetHi not contiguous!" << std::endl; }
All of these problems occur because the program is relying on language features that are implementation dependant. It would be nice if the compiler would give warnings if such features are used but the standard does not mandate it and not many compilers do. If implementation dependant features must be used, it is up to the programmer to make checks to ensure the program is valid.
The student implies that when the inner structure is given a name and the #defines
that simulate anonymous structures are used, the warnings disappear. This is very strange. #defines
are little more than a text substitution mechanism and so I would not have thought they would have any effect on warning message issued. Also, I would not have thought giving the inner structure a name would have any affect either, so this remains a mystery. Irrespective of whether a warning is issued, any underlying problem would remain the same. There is still implementation dependant behaviour that must be considered.
As the student simply wishes to refer to the high and low order bytes of a 64-bit address, the supplied program could be simplified somewhat. The outer struct
could be removed and the inner struct
returned to being anonymous. There would then be no need for the #defines
.
From these observations it can be seen that the student’s code is not without its problems. Perhaps there is another method that can achieve the same result in a simpler way. It is always worth reflecting on the code one writes to see if there is a more elegant solution. I now describe one such method that is worth considering.
The method consists of two stages. Firstly, the high order bytes of the variables address are set to zero and the result stored for later display. Secondly, the high order bytes are shifted into the position occupied by the low order bytes. The result is also stored for later display. From a practical point of view, there is one slight complication. The address of the variable var
has to be cast to an integer type before any zeroing or shifting can take place. This is no great hardship and has the advantage of making it clear to anyone reading the code that the address of var
is being handled in somewhat unconventional way. Code using this method is shown below.
const auto address = reinterpret_cast<int64_t>(&var); const auto offsetLo = address & 0xffffffff; const auto offsetHi = address >> 32; std::cout << "Address = " << std::hex << offsetHi << "/" <<offsetLo << std::endl;
To conclude, it is preferable to write simple code that does not rely on features of the implementation. If this is not possible, include checks to ensure such features are consistent with the operation of the program. Ideally, such checks should take place at compile-time. If this is inconvenient or impossible, run-time checks should be performed.
Commentary
This problem revolves around trying to treat a pointer value as an integral value. This is inherently non-portable as C++ makes few guarantees about the sizes of fundamental types and the relationship between pointer values and integral ones. While this enables C++ to be efficient on a wide range of platforms with different address ranges and register sizes (since the compiler can use a ‘natural’ word size) it does make it harder to write code targeting specific platforms.
The types provided by the C header <stdint.h> (and available to C++ using the <cstdint>
header) provide a way to access various sized integer types if such types are available on the target implementation. If the int
N_t
type is available it ‘designates a signed integer type with width N, no padding bits, and a two’s complement representation’.
I do not think however that a signed type is a good choice for breaking an address into two parts – it would be better to use the unsigned type uint32_t
. The use of signed integers to represent address values is a common source of errors for 32 bit programs, typically when addresses are compared. It seems to be less of a problem for 64 bit programs since the top bit of the 64bit address is clear (at least for user-mode programs) on both Windows and Linux, so treating the address as a signed value does not change the meaning of any pointer comparisons. If provided, the type uintptr_t
is a typedef for an integral type large enough to round-trip any pointer to void. (There is a signed equivalent too, but I personally don’t recommend using that for the reasons above.)
The original code that used an anonymous structure and an anonymous union gets a warning when used in C++ programs since although anonymous structs are allowed by the ISO C standard (since C11) they are not part of ISO C++. (MSVC gives a warning in C mode that a non-standard extension is being used – but this refers to the superseded C standard.) I apologise for the lack of clarity in the code critique about the exact form of the code using the anonymous struct.
The technique of breaking a pointer into two parts by writing into one element of a union and reading back from another one normally works in practice, but care must be taken to avoid subtle bugs.
As James points out, the student’s example assumes a little-endian machine so that the layout of the 32 bit integers results in the low 32 bits of the pointer value overlaying offsetLo
and the high 32 bits overlaying offsetHi
.
It would be well worth while adding some checks (some are possible at compile time using static_assert
) to both verify that these assumptions hold now and to avoid unpleasant surprises when the code is recompiled with a different compiler or for a different target platform. It is unfortunately all too easy when writing these checks to accidentally invoke undefined behaviour.
The use of a #define
to simulate the anonymous struct is worrying mostly because of the normal problems of preprocessor definitions – they are a textual substitution done without full integration with the syntax of the language.
In this example, suppose another piece of code uses the symbol offsetLo
and includes the address.hheader file?
Adding the following variable declaration to main
:
int offsetLo = 0;
caused an error message, from one compiler, stating: "'->': trailing return type is not allowed after a non-function declarator"
. It is not entirely clear to me why! gcc was more helpful and provided a note referring to the macro definition of
offsetLo
, but this ‘poisoning’ of the identifiers offsetLo
and offsetHi
is still a nuisance. One possible approach might be to provide inline member functions offsetLo()
and offsetHi()
and encapsulate the union ‘hack’ inside the implementation of the ‘address’ structure.
One last minor usability problem with the header file is that it uses data types defined in <cstdint>
but relies on this header being already included. This makes code using the header fragile as an unrelated removal of a header file can cause compilation errors in this one.
The Winner of CC 94
There was only one entrant this time, but James covered most of the problems with the code and provided a helpful diagram of the difference between little and big endian layout so I think he deserves the prize despite the lack of competition! Perhaps the next critique will attract a little more interest?
Code Critique 95
(Submissions to scc@accu.org by Oct 1st)
I have some C code that tries to build up a character array using printf
calls but I’m not getting the output I expect. I’ve extracted a simpler program from my real program to show the problem.
With one compiler I get: "Rog"
and with another I get "lburp@"
.
I’m expecting to see:
"Roger: 10 Bill: 5 Wilbur: 12"
What have I done wrong?
The code is in Listing 2.
#include <stdio.h> #define ARRAY_SZ(x) sizeof(x)/sizeof(x[0]) typedef struct _Score { char *name; int score; } Score; void to_string(Score *scores, size_t n, char *buffer, size_t len) { for (size_t i = 0; i < n; i++) { size_t printed = snprintf(buffer, len, "%s:\t%u\n", scores[i].name, scores[i].score); buffer += printed; len -= printed; } } void process(char buffer[]) { Score sc[] = { { "Roger", 10 }, { "Bill", 5 }, { "Wilbur", 12 }, }; to_string(sc, ARRAY_SZ(sc), buffer, ARRAY_SZ(buffer)); } int main() { char buffer[100]; process(buffer); printf(buffer); } |
Listing 2 |
You can also get the current problem from the accu-general mail list (next entry is posted around the last issue’s deadline) or from the ACCU website (http://www.accu.org/journals/). This particularly helps overseas members who typically get the magazine much later than members in the UK and Europe.
Notes:
More fields may be available via dynamicdata ..