Title: An alternative to wchar_t

Author:

Date: 27 March 1995 18:22:18 +01:00 or Mon, 27 March 1995 18:22:18 +01:00

Summary:

Body:

Not all useful integers on a given machine are necessarily represented by C's int type; so there is long int with a minimum range of a 32 bit one's complement integer. Likewise, not all character sets may be represented by char, so there is a need for a wider character type. What should its name and representation be? There appear to be three possibilities for implementation: typedef an implementation dependent integral type; define a standard struct, or class in C++, that appropriately describes the wide character; or add a new primitive type to the language.

The question boils down to what criteria should be used in deducing whether something is genuinely a new language type:

portability is often a driving requirement for both new types and new type names;
meeting a previously unfulfillable need is often the indicator for a new type, built-in or otherwise;
a literal constant form seems to indicate a new built-in type;
miscibility with existing primitives indicates a new built-in type in C, but not necessarily in C++;
the need to strongly distinguish between types is an indication of a new type, especially in C++.

Portability defined the need to add the types ptrdiff_t and size_t. The opaque fpos_t type was added to <stdio.h> to allow portable positioning within very large files using the fgetpos and fsetpos functions. Portability was also a reason for adding the third char type, signed char. This move also plugged an obvious type gap in the language. The addition of long double as a type met the demands for higher precision numerical computation. A primitive bool type has recently been added to C++ to allow differentiation from int for function overloading. It will also reduce the countless roll-you-own Boolean enums, typedefs and classes littering application and library code today. Many have tried, but it is impossible to create a useful Boolean enumeration or class in C++.

Before ANSI the need for wide characters was not explicitly catered for in C. Programmers of truly international software were forced to use raw integers for wide characters or a multi-byte representation. Widespread use of the language meant that with standardisation internationalisation was a top priority. This has lead to the addition of locales as well as basic support for wide and multi-byte characters to represent non-western character sets. The number and scope of these functions are sure to be extended in the next revision of the ISO C standard; it is a shame that with the exception of locales they all ended up in <stdkitchensink.h>.

The ANSI C committee added wchar_t, a synonym for an existing integral type, to <stddef.h> and <stdlib.h>. This makes wide characters easier to use than a struct such as XChar2b used for representing 16 bit characters in X. The committee also added manifest constant forms to the language for wide characters and strings:

wchar_t Char = L'a';
wchar_t String[] = L"a";

One would have thought that any type that had a literal form was obviously primitive: adding a new language type, rather than simply aliasing an integer, would appear to be the correct approach. However, C's already confused notion of char and int sets a precedent:

sizeof('a') == sizeof(int)

The new literal form for wide characters is effectively just another form of integer constant. In C++ the notion of exact type rather than coercible type plays a more fundamental role. Much of this nonsense has been sorted out:

sizeof('a') == sizeof(char)

The joint C++ standardisation committees have also recognised that wide characters deserve a type of their own, adding wchar_t as a new keyword and integral type. To understand this decision consider the problem of overloading output functions:

void Print(char);
void Print(wchar_t);
void Print(int);

This is not portable if wchar_t is a typedef or a macro because it will be a synonym of an integral type. An alias for int will cause a number rather than a character to be printed out. The compiler would also object to encountering a second definition of Print(int), assuming that all Print functions were defined in the same translation unit, otherwise the ball gets passed to the linker. A first cut solution is to introduce wchar_t as a standard library class. However, what type does that make literals like L'a'? The only solution in this case is to add a new language type. I would hasten to add that this in not just a solution for hacking C++, but a retrospective correction of what should originally have happened in C.

The only thing that remains for me to say against wchar_t is that the name is dreadful. Adding new keywords is always a problem, but wchar_t must count as one of the clumsiest - especially since the _t suffix has traditionally indicated a typedef^[1]. I will, however, grant you that this new keyword is not likely to break many programs. (I am still surprised to see C programmers using class and try as identifiers. Where have they been? More to the point, where are they going?)

Given that long int and long double are the wider versions of int and double, what is wrong with long char? This requires no new keywords and it is also more obviously a character type. Interestingly the syntax of C and C++ does not exclude this formation. One criterion that Bjarne Stroustrup has used in deciding between features is to gauge how easy it would be to teach and learn them. That long char is unmistakably a character type and goes a long way to achieving this.

I do not feel the necessity to further complicate this type with sign, but signedness could obviously follow the char model if required. For compatibility long char must have the same implementation as one of the standard integral types:

sizeof(long char) == sizeof(char) ||
sizeof(long char) == sizeof(short) ||
sizeof(long char) == sizeof(int) ||
sizeof(long char) == sizeof(long)

The retrofit for both C and C++ would be to add the new type to the language and simply mandate that it is the synonym type for wchar_t. The decision to include wchar_t as a built-in type in C++ is not so old and widely implemented that it cannot be reversed to be replaced by long char. I recognise that the schedule for creating the C++ standard is already pressed and that this suggestion is not simply a global replacement of long char for wchar_t in the forthcoming draft. However, I do not believe it to be complex - unlike run-time type identification, exception handling and namespaces, for instance - and in many senses it is a reduction and not an extension. It is certainly more in the spirit of the language.

With this in mind, I have submitted a proposal to ISO for such a change (for ISOlogists the proposal is numbered WG21/N0507). My thanks to Sean Corfield for his feedback and for agreeing to propose it - read his column, The Casting Vote, in the coming months to find out which way this and a number of other issues go. The feedback has generally been good, but Francis informed me that at a recent ISO C meeting Plauger was less than impressed with the hidden implication that the C standard is anything less than perfect! Oh well, you can't please all of the people...

^[1] This said, Francis pointed out to me that none of the types ending in _t in the standard are required to be typedefs; macros also satisfy the spec.

Notes:

More fields may be available via dynamicdata ..