Journal Articles

Overload Journal #116 - August 2013 + Programming Topics
Browse in : All > Journals > Overload > o116 (6)
All > Topics > Programming (877)
Any of these categories - All of these categories

Note: when you create a new publication type, the articles module will automatically use the templates user-display-[publicationtype].xt and user-summary-[publicationtype].xt. If those templates do not exist when you try to preview or display a new article, you'll get this warning :-) Please place your own templates in themes/yourtheme/modules/articles . The templates will get the extension .xt there.

Title: Portable String Literals in C++

Author: Martin Moene

Date: 03 August 2013 18:11:08 +01:00 or Sat, 03 August 2013 18:11:08 +01:00

Summary: How hard can it be to make a file in C++ with international text literals in its name? Alf Steinbach shows us.

Body: 

C++ lacks a built-in or library-provided character encoding value type that reflects the main conventions for the encoding of international text literals, API arguments and, for *nix, external text, namely UTF-8 for *nix1 and UTF-16 for Windows2. As a consequence, standard C++ code that works fine in *nix fails outright or produces erroneous results in Windows, as exemplified below. Portable code deals with this by converting strings at run time (efficiency/complexity cost), and by employing brittle conventions (programmer’s time cost), and in teaching the problem is largely just ignored, letting students produce programs that, for example, are unable to deal with their Norwegian names (cost of negative perception of the language – a language so primitive that it can’t even handle text).

The C++11 standard added the literal prefixes u8, u and U that specify known sizes and encodings, respectively UTF-8, UTF-16 and UTF-32. But no matter whether one chooses3 u8, u or U, the code needs added runtime conversions on one or the other platform. Exacerbating the situation, the C++ standard library supports only char-based narrow strings in filenames and exception messages, which, for example, means that the current Boost filesystem library4 can’t access many Windows files – the main desktop platform’s files – when it’s used with the g++ compiler.

Happily the limited issue of suitable original string data for portable code, with UTF-8 for *nix and UTF-16 for Windows, can be dealt with ‘simply’ by using macros that adjust the form of literals. Proper core language support would be better still, but a suitable macro + supporting functionality addresses the problem at compile time, most efficiently, with a single common portable notation. And happily, when the macro always produces a Unicode literal then there is no problem with different character sets (only the encoding differs across systems), and when the macro produces a distinctly typed result5 then there is no problem with inadvertent mixing of incompatible encodings such as Windows ANSI and UTF-8.

Relevant character encodings and terminology

In the middle 1960s, US government computers employed a large number of incompatible character encodings, which reduced interoperability and added needless costs and hassle. The American National Standards Institute, ANSI6, therefore created a more general single-byte character encoding which became known as ASCII, the American Standard Code for Information Interchange. And on March 11 1960, President Lyndon B. Johnson approved ASCII as a US federal standard.

The ASCII code was English only. So, while ASCII largely solved the Tower of Babel problem within the English-speaking world, the same problem now resurfaced in the rest of the Western world. From this arose a single-byte ASCII extension intended to serve the needs of Western countries, called ISO Latin 1.

The first Windows versions were based on a Microsoft extension of Latin 1 called Windows ANSI. Today that term has taken on a more general meaning (discussed below), and the original Windows ANSI encoding is now known more precisely as Windows ANSI Western, or codepage 1252. A Windows codepage is a number that designates a character encoding in Windows; reportedly it originally referred to a tabular display of a single-byte encoding, literally a ‘code page’, like Figure 1.

CP 1252 (Windows ANSI Western ext. of Latin 1)
     0 1 2 3 4 5 6 7 8 9 A B C D E F

00   - - - - - - - - - - - - - - - -
10   - - - - - - - - - - - - - - - -
20     ! " # $ % & ' ( ) * + , - . /
30   0 1 2 3 4 5 6 7 8 9 : ; < = > ?
40   @ A B C D E F G H I J K L M N O
50   P Q R S T U V W X Y Z [ \ ] ^ _
60   ` a b c d e f g h i j k l m n o
70   p q r s t u v w x y z { | } ~ 
80   €  ‚ ƒ " … † ‡ ˆ ‰ Š ‹ Œ  Ž 
90    ' ' " " o - - ˜ ™ š › œ  ž Ÿ
A0     ¡ ¢ £ ¤ ¥ ¦ § ¨ © ª "  ­ ® ¯
B0   ° ± ² ³ ´ µ  · ¸ ¹ º " ¼ ½ ¾ ¿
C0   À Á Â Ã Ä Å Æ Ç È É Ê Ë Ì Í Î Ï
D0   Ð Ñ Ò Ó Ô Õ Ö × Ø Ù Ú Û Ü Ý Þ ß
E0   à á â ã ä å æ ç è é ê ë ì í î ï
F0   ð ñ ò ó ô õ ö ÷ ø ù ú û ü ý þ ÿ
			
Figure 1

In Figure 1, table rows 00H through 70H constitute original ASCII. Rows 80H through F0H were added in ISO Latin 1, except that in ISO Latin 1 rows 80H and 90H are undefined characters. The characters shown in rows 80H and 90H in Figure 1, including the Euro sign €, are the Windows ANSI Western extension of Latin 1 (in original Windows ANSI there was, of course, no Euro sign, since there was no Euro).

At some point7 Windows started supporting local variants of Windows ANSI Western, e.g. with Cyrillic or Greek characters. Whatever narrow encoding used in the GUI, reported by GetACP(), is known as Windows ANSI, as opposed to the OEM character encoding which is the local chosen variant of the original IBM PC encoding, used in text consoles. The different variants of Windows ANSI ensures a global Tower of Babel problem, while the use of two incompatible narrow character encodings on the same machine, namely OEM and Windows ANSI, ensures that there's also a local Tower of Babel problem – at least for Windows users.

To address the general Tower of Babel problem a number of leading computer industry firms cooperated on developing a ‘universal’ character encoding, an extension of ISO Latin-1 which became known as Unicode. Original Unicode was a fixed size 16-bit per character encoding, and 32-bit Windows NT, introduced in 1992, was based on this 16-bit encoding. However, 16 bits didn’t suffice for e.g. Chinese ideograms, so Unicode was extended to 21 bits per character, and for the existing software the added characters were to be represented as pairs of 16-bit values, called surrogate pairs. Today this encoding is known as UTF-16, and the original 16-bit per character representation is known as UCS-2 (two bytes per character). Windows’ console subsystem API supports copying of rectangular areas of console windows, but only with 16 bits per character, so console windows are effectively limited to UCS-2, while the rest of Windows is now generally UTF-16.

32-bit Windows includes many wrapper functions that automatically convert from legacy code’s Windows ANSI to the basic API's UTF-16, and back. Typically there is an UTF-16 based function called FooW, and a Windows ANSI wrapper called FooA. This legacy code support extends to the graphical user interface. However, with respect to window messages (small fixed format data packets used to control windows) Microsoft duplicated its file access API blunder, by using configurable encoding expectations. Pointers in window messages are untyped, and when a given message contains a pointer to a string, then that untyped string is encoded as Windows ANSI or UTF-16 depending on the particular window’s configuration… Thus the terms ANSI window and Unicode window. ‘Windows ANSI’ refers to the narrow character encoding used in the graphical user interface and reported by the GetACP API function, while ‘ANSI window’8 refers to a window configured to expect and produce Windows ANSI encoded strings in its window messages.

UTF-8, very popular in *nix and for web pages, is an ASCII extension that encodes all of Unicode by using a variable number of bytes per character.

The inefficiency, complexity and current real world non-portability of standard C++ string literals

Let’s check how some basic, completely standard and therefore presumably automagically9 portable C++ source code fares in Windows (see Listing 1).

// Source encoding: UTF 8 with BOM (necessary for
// Visual C++).
#include <assert.h>     // assert
#include <fstream>      // std::ofstream
auto main() -> int
{
  auto const filename = "p.recipe";
  // A pie recipe. :-)
  std::ofstream f( filename );
  assert( "File creation" && !!f );
}
			
Listing 1

Compiling with the MinGW g++ 4.7.2 compiler (a Windows build of the GNU toolchain’s C++ compiler), running the program and checking the result (see Figure 2).

> del a.exe *.recipe 2>nul &^
More? g++ cplusplus_stdlib_version.cpp &&^
More? a.exe && dir /b *.recipe
Ï€.recipe
			
Figure 2

This produced an erroneous result, a filename different from the specified one, namely Ï€.recipe instead of the specified π.recipe.

In some cases, but mostly with Microsoft’s Visual C++, this happens because an UTF-8-encoded source is misinterpreted as a Windows ANSI-encoded source (so it’s worth checking that the source encoding is correct!), but the reason above is that the MinGW g++ compiler and its standard library implementation have different opinions about what the C++ execution character set is or should be.

The g++ compiler defaults to UTF-8, which is the de facto standard narrow string encoding in *nix, while its standard library implementation, presumably delegating to Microsoft’s runtime library, defaults to Windows ANSI, which is the de facto standard narrow string encoding in Windows programming.

Adjusting the g++ compiler’s execution character set to match its standard library’s expectations will in general not help in obtaining a correct result, since most variants of Windows ANSI lack the lowercase Greek π character. But it does convert the silent erroneous result behaviour to a work-saving up-front compilation error. So, when using g++ in Windows, to avoid possible silent erroneous results do add the -fexec-charset=cpYourANSICodepageNumber option, e.g. as shown in Figure 3.

> del a.exe *.recipe 2>nul &^
More? g++ cplusplus_stdlib_version.cpp -fexec-charset=cp1252 &&^
More? a.exe && dir /b *.recipe
cplusplus_stdlib_version.cpp: In function 'int main()':
cplusplus_stdlib_version.cpp:7:27: error: converting to execution character set: Illegal byte sequence	? Nice up-front compilation error.
cplusplus_stdlib_version.cpp:7:27: error: unable to deduce 'const auto' from '<expression error>'
			
Figure 3

So, how about using Windows’ own main compiler, Microsoft’s Visual C++, for this code? (See Figure 4.)

> del b.exe *.recipe 2>nul &^
More? cl cplusplus_stdlib_version.cpp /Fe"b.exe" &&^
More? b.exe && dir /b *.recipe
cplusplus_stdlib_version.cpp
cplusplus_stdlib_version.cpp(7) : warning C4566: character represented by universal character name '\u03C0' cannot be represented in the current code page (1252)
Assertion failed: "File creation" && !!f, file cplusplus_stdlib_version.cpp, line 10
			
Figure 4

Here Visual C++ unfortunately accepted the source code, but happily the program then produced a runtime error. This is far better than g++’s default silent erroneous result, but it’s rather ungood news for the portability of pure standard C++ source code as of 2013. Currently, the two main free C++ compilers for Windows are Visual C++ (Microsoft) and g++ (GNU), and as exemplified above neither of them support UTF-8 string constants for e.g. filenames.

It’s not that Windows can’t handle the π.recipe filename. Unicode filenames are supported by the Windows API, they’re supported by Windows-specific library extensions such as _wfopen, and there’s no problem creating or accessing such a file in e.g. Java or C# or Python 3. The problem is that such files can’t be accessed using only portable, pure standard C++ source code, and also that even if C++ had the wide string support that is de facto standard in Windows, using it directly for portable code would be needlessly inefficient in *nix; and the part of that problem that I address here is the support for string literals.

How Boost filesystem doesn’t help

After the C++ standard library the next place to look for general functionality is usually the Boost library. For our example code the relevant sub-library is the Boost filesystem library. The Boost filesystem library, but apparently sans the boost::filesystem::ofstream class10 that’s used below, is slated for inclusion in C++ Technical Report 2 (TR2)11, which effectively means also in the next C++ standard.

However, the Boost filesystem library does not offer or visibly use12 portable system dependent strings, and so for portable code, with the Boost filesystem library a filename such as "Ï€.recipe" has to be specified as a wide string, like L"Ï€.recipe".

Since it’s impractical to deal with two or more different string formats, one would then presumably standardize on using wchar_t based strings for all portable strings. This then incurs a string conversion cost in *nix, in the worst case for most every API call involving strings, which is counter to the general C++ principle of not paying for what you don’t use. This cost (and others) is meant to buy a correct result, so let’s check whether the Boost filesystem library actually does produce a correct result? (Listing 2)

// Source encoding: UTF 8 with BOM 
// (necessary for Visual C++).
#include <assert.h>                 // assert
#include <boost/filesystem/fstream.hpp
 // boost::filesystem::ofstream
namespace bfs = boost::filesystem;
auto main()
  -> int
{
  auto const filename = L"p.recipe"; 
  // A pie recipe. :-)
  bfs::ofstream f( filename );
  assert( "File creation" && !!f );
}
			
Listing 2

Compiling the program with the Visual C++ 11.0 compiler, using boost 1_54 filesystem and system libraries, and running gives the result shown in Figure 5.

> del b.exe *.recipe 2>nul &^
More? cl cplusplus_boost_version.cpp /MD /Fe"b.exe" /I"%boost_pincludes%" /link %msvc_link_bfs% &&^
More? b.exe && dir /b *.recipe
cplusplus_boost_version.cpp
p.recipe
			
Figure 5

Well, that worked nicely! At an encoding conversion cost for *nix, and at the general cost of using Boost. But how about building with MinGW g++ 4.7.2? (See Figure 6.)

> del a.exe *.recipe 2>nul &^
More? g++ cplusplus_boost_version.cpp -fexec-charset=cp1252 %gnuc_using_bfs% &&^
More? a.exe && dir /b *.recipe
Assertion failed: "File creation" && !!f, file cplusplus_boost_version.cpp, line 12

This application has requested the Runtime to terminate it in an unusual way.
Please contact the application's support team for more information.
			
Figure 6

The Boost filesystem library takes advantage of a Visual C++ extension to the standard library, namely a wchar_t based ofstream constructor, when the library is built with Visual C++. The g++ compiler’s standard library implementation has a more clean extension, a std::streambuf subclass that can be initialized from a C FILE*. And a possibly more efficient workaround for the standard library’s lack of Unicode filename support, which works with any compiler, is Windows’ so called ‘short’ or ‘DOS’ or ‘8+3’ filenames, which were used in Boost filesystem version 2.13 But the current Boost filesystem library simply doesn’t support Windows C++ compilers in general. For Windows it now only provides full functionality, the ability to portably access files with names such as π.recipe, when it’s used with Visual C++ or a compiler with the same standard library extensions…

If the Boost filesystem library is just made part of the C++ standard we’ll then have an absurdity: a part of the standard library making essential use of wide string based constructors, and thus effectively requiring them14 of all Windows standard library implementations, without having them standardized and available to all.

Summing up, since using Boost filesystem as a portability layer requires using wide strings it incurs an efficiency/complexity cost on *nix, a cost that in a great many cases buys you nothing. And worse, the current version doesn’t even produce correct results with g++ in Windows, thus not providing the goods that the cost was meant to cover. Thus, as of this writing (July 2013) Boost filesystem is not a solution.

Strongly typed system dependent literals

In the same way that C++ integer types such as int are portable because their sizes depend on the system, one can define a character encoding value type15 that’s portable because its size and assumed encoding depends usefully on the system. I.e., a system dependent character encoding value type, which is portable precisely because it’s system dependent – just as with the int type etc. It can look like Listing 3.

#ifdef _WIN32
   namespace cppx{ typedef wchar_t Raw_syschar; }
   // Implies UTF-16 encoding.
#  define CPPX_WITH_SYSCHAR_PREFIX( lit ) L##lit
#else 
   namespace cppx{ typedef char Raw_syschar; }
   // Implies UTF-8 encoding.
#  define CPPX_WITH_SYSCHAR_PREFIX( lit ) lit
#endif
			
Listing 3

The _WIN32 macro, a de facto standard in Windows C and C++ programming, is defined for both 32-bit and 64-bit Windows programming. There is one problem with the Raw_syschar type, though, namely that it’s just a synonym for another type that it isn’t distinct. For example, one cannot define a distinct std::basic_string specialization for it. It’s practically possible16 to define a distinct Raw_syschar type as a class, but in order to be able to put that inside a constructor-free union – as can happen with the short string optimization17, where the union then occurs in the std::basic_string implementation – it would need to be without any user defined constructor. That means that it would need to expose a public data member, which is somewhat unclean, and different from use of basic types like char and wchar_t.

Happily with C++11, and with Visual C++ for a some time before that as a language extension, one can define an enum type with a specified underlying representation (this and all the following definitional code is in namespace cppx):

  enum Syschar : Raw_syschar {};

This produces a type with very much the desired properties18 of a character encoding value type, namely, it’s a distinct type that supports all the built-in comparison operators, and it provides an implicit conversion to integer.

And just by defining a std::char_traits specialization this type supports a distinct specialization of std::basic_string, if you should want that. Such a std::char_traits specialization is just a collection of static member functions that forward to the corresponding functions for the raw character type. However, such forwarding functions require general conversion between raw and typed characters and character strings - e.g. the following three typed functions for converting to strongly typed form, and corresponding raw functions the other way (Listing 4).19

auto typed( Raw_syschar const c )
  CPPX_NOEXCEPT
  -> Syschar
{return static_cast< Syschar const >( c );}
auto typed( Raw_syschar* const s )
  CPPX_NOEXCEPT
  -> Syschar*
{return reinterpret_cast< Syschar*>( s );}
auto typed( Raw_syschar const* const s )
  CPPX_NOEXCEPT
  -> Syschar const*
{return reinterpret_cast< Syschar const* >( s );}
			
Listing 4

This looks trivial, yes?

Unfortunately, in order to later be able to construct a class type string very efficiently from a literal, it’s very desirable to also have a function template like Listing 5, but this function template can then never be implicitly selected. The reason is that for an array type actual argument of any given size the corresponding specialization would not offer a better argument conversion than the pointer argument function. With the specialization the call would therefore be ambiguous. And then the C++11 standard decrees in its §13.3.3/1 fifth dash that F1 is a better function than F2 if “F1 is a non-template function and F2 is a function template specialization”, which for the above functions means that the pointer argument function will always win.

template< Size n >
auto typed( Raw_syschar const (&a)[n] )
  CPPX_NOEXCEPT
  -> Syschar const (&)[n]
{return reinterpret_cast< Syschar 
   const (&)[n] >( a ); }
			
Listing 5

My chosen fix is to route all calls to functions in a given set (e.g. typed() calls) via a single function template. The template just checks the actual argument type and dispatches the real work. To enable the dispatch call’s function selection each of the typed overloads, and also each of the raw overloads, is outfitted with a defaulted nameless dummy argument that identifies the general kind of actual argument (see Listing 6).

namespace detail {
  ...
  inline
  auto typed( Raw_syschar const* const& s,
              Pointer_kind = Pointer_kind() )
    CPPX_NOEXCEPT
    -> Syschar const* const&
  { return
    reinterpret_cast< Syschar const* const& >
    ( s ); }
  template< Size n >
  auto typed( Raw_syschar const (&a)[n],
              Array_kind = Array_kind() )
    CPPX_NOEXCEPT
    -> Syschar const (&)[n]
  { return reinterpret_cast
     < Syschar const (&)[n] >( a ); }
}  // namespace detail
			
Listing 6

The function template for this set of functions, through which all typed calls go (Listing 7), where Type_kind_ is part of the small machinery that checks the argument type (see Listing 8).

template< class Arg >
auto typed( Arg const& arg )
  CPPX_NOEXCEPT
  -> decltype( detail::typed( arg,
    typename Type_kind_<Arg>::T() ) )
{ return detail::typed( arg,
    typename Type_kind_<Arg>::T() ); }
			
Listing 7
#pragma once
// Copyright (c) 2013 Alf P. Steinbach
// Mostly this is to enable a workaround for
// ordinary overload resolution.
#include <rfc/cppx/core/Size.h>  // cppx::Size
namespace cppx {
  enum Value_kind {};
  enum Pointer_kind {};
  enum Array_kind {};
  template< class Type >
  struct Type_kind_ { typedef Value_kind T; };
  template< class Type >
  struct Type_kind_<Type*> {
     typedef Pointer_kind T; };
  template< class Type >
  struct Type_kind_<Type* const> {
     typedef Pointer_kind T; };
  template< class Type, Size n >
  struct Type_kind_< Type[n] > {
     typedef Array_kind T; };
}  // namespace cppx
			
Listing 8

Listing 9 is the file creation program again, but now using Syschar directly (only the machinery shown so far), producing a correct result. The just-for-this-example ad hoc header x/ofstream.h defines a subclass of std::ofstream called x::ofstream that provides a Syschar-based constructor by employing compiler-specific functionality. The necessity of compiler-specific or at least system-specific code for such basic functionality indicates to me that this area of functionality belongs in the standard.

// Source encoding: UTF 8 with BOM (necessary
// for Visual C++).
#include "x/ofstream.h"     // x::ofstream
#include <assert.h>         // assert

auto main() -> int
{
  using cppx::typed;
  // A pie recipe. :-)
  auto const filename = typed
     ( CPPX_WITH_SYSCHAR_PREFIX( "p.recipe" ) );
  x::ofstream f( filename );
  assert( "File creation" && !!f );
}
			
Listing 9

But as the declaration of filename in Listing 9 shows, direct use of the conversion functionality defined so far yields rather verbose specifications of literal strings…

To support more concise usage expressions I therefore define two further macros, CPPX_U to express a typed literal and CPPX_RAW_U to express an untyped one (Listing 10).

#define CPPX_AS_SYSCHAR( lit ) \
 ::cppx::typed( CPPX_WITH_SYSCHAR_PREFIX( lit ) )

#define CPPX_U      CPPX_AS_SYSCHAR
#define CPPX_RAW_U  CPPX_WITH_SYSCHAR_PREFIX
			
Listing 10

And with CPPX_U the file creation program looks, to my eyes, acceptable (Listing 11).

// Source encoding: UTF 8 with BOM 
// (necessary for Visual C++).
#include "x/ofstream.h"     // x::ofstream
#include <assert.h>         // assert
auto main() -> int
{
  auto const filename = CPPX_U( "p.recipe" );
  // A pie recipe. :-)
  x::ofstream f( filename );
  assert( "File creation" && !!f );
}

			
Listing 11

When it’s compiled for Windows this program uses UTF-16 encoded wchar_t based strings, and when it’s compiled for *nix it uses UTF-8 encoded char based strings. Unlike the C++ standard library and unlike Boost filesystem this ensures maximum efficiency for API calls, i.e. no runtime encoding conversion. And also unlike the C++ standard library and unlike Boost filesystem, with the necessary higher level functional support such as exemplified by x::ofstream, it provides access to all valid filenames on each system, lets students almost effortlessly write portable basic C++ programs that can handle Norwegian student names, etc.

Summary and final considerations

Standard C++11 does not provide the means to access Windows files in general, because the filenames can’t be expressed as Windows ANSI encoded char based strings. The Boost filesystem library, slated for inclusion in TR2, imposes an efficiency cost for portable code used in *nix by requiring portable strings to be wchar_t based. And in Windows the Boost filesystem library only supports general Unicode filenames when it’s used with the Visual C++ compiler.

The main idea for the library solution presented here is to use only the portable CPPX_U string notation in the portable code, and to have such strings reinterpreted as system specific char or wchar_t based strings for the system dependent implementation code, if any, and as necessary. By using a character encoding value type that’s defined differently depending on the system, plus a macro that adds strong typing and an L literal prefix as required for each system, the exact same source code can specify strongly typed string literals with UTF-8 encoding for *nix, and with UTF-16 encoding for Windows. This is maximally efficient for each system’s API function calls and favoured external text encoding, and makes it technically possible to access all valid filenames on each system, as shown.

To make this work most seamlessly the C++ source code should then be UTF-8 encoded with BOM, because that encoding is accepted and understood by default by both Visual C++20 and g++, and because support for this source encoding is a reasonable requirement for any C++ compiler that one might consider using.

  1. I haven’t found any authoritative statements or data about *nix character encodings other than Markus Kuhn’s Unix Unicode FAQ maintaining that “UTF-8 is the way in which Unicode is used under Unix, Linux, and similar systems”. In Nov. 2011 I asked about it on Stack Exchange, but alas without a definitive answer. If you’re interested in various opinions and details then check out that question at: http://unix.stackexchange.com/questions/24529/most-common-encoding-for-strings-in-c-in-linux-and-unix.
  2. The main Windows C++ compiler, Visual C++, supports only Windows ANSI as a narrow C++ execution character set, and UTF-16 for wide string literals. Windows ANSI cannot portably encode international text and incurs conversion costs. UTF-16, in Windows called ‘Unicode’, is therefore used by the vast majority of projects, and is the default in Visual Studio projects.
  3. At the time of writing, Visual C++ in version 11.0 does not yet support the C++11 u8, u and U prefixes.
  4. As of Boost version 1.54, released during the writing of this article.
  5. For standard C++ the u8 prefix does produces a char based literal.
  6. At the time known as the United States of America Standards Institute, USASI; the name was changed to the American National Standards Institute, ANSI, in 1969.
  7. According to Wikipedia’s codepage article, at http://en.wikipedia.org/wiki/Code_page, DOS gained codepage support in version 3.3, in 1987, while the first version of Windows was released in 1985.
  8. The term ‘ANSI Windows’ was used by one reviewer, who conflated it with ‘Windows ANSI’ (encoding) and ‘ANSI window’ (configuration). This term can appear to be used when ‘ANSI’ is used as a qualification. E.g. ‘ANSI Windows codepages’, meaning ‘ANSI (Windows codepages)’, the codepages that can be used as Windows ANSI, i.e., that can be returned by GetACP.
  9. ...to compilers that support UTF-8 source code, which all the relevant compilers do. More in general portable for C++ means portable within the limits of the language implementation that one ports to. E.g., putting this to the point, the C++ standard does not specify the size of bool so that frivolous use of bool type local variables conceivably could exceed the available memory, yet such code is portable. One reviewer has however argued that C++ only supports source code with the characters formally guaranteed to be supported, i.e. only pure ASCII source code with no “$” signs, portably.
  10. Judging by the N3693 draft Technical Specification at http://isocpp.org/files/papers/N3693.html
  11. Wikipedia lists the TR2 proposals at http://en.wikipedia.org/wiki/C++_Technical_Report_1#Technical_Report_2
  12. Internally the boost::filesystem::path class uses a representation of international text where the public definition value_type corresponds to the ‘raw’ encoding value type discussed in this article, with UTF-8 for *nix and UTF-16 for Windows. Presumably with C++14 (if that should be the next C++ standard), this article’s Raw_syschar could be defined as std::filesystem::path::value_type.
  13. I filed a ticket about its disappearance in 2011, #6065 available at https://svn.boost.org/trac/boost/ticket/6065
  14. The N3693 draft Technical Specification contains this wording in its §8.4.6: “Implementations of the standard library for systems where string_type is wstring, such as Windows, are encouraged to provide an extension to existing standard library file stream constructors and open functions that adds overloads that accept wstrings for file names. Microsoft and Dinkumware already provide such an extension.”
  15. The wchar_t type can be argued to be such a type, but it’s impractical for the purpose of portability.
  16. While it’s not guaranteed by the C++ standard, as far as I know there’s no compiler that by default will yield sizeof(T) > 1 when T is a POD class with just a single char data member.
  17. The last time I checked, two or three years ago, it did happen with Visual C++’s std::string.
  18. If names of e.g. control characters are desired then one can use an enum class in order to support easy name qualification, but for this article’s exposition enum class would not have a purpose.
  19. Here CPP_NOEXCEPT is a macro that depending on the compiler is defined as C++11 noexcept (e.g. for g++ and clang) or C++03 throw() (for Visual C++ 11.0 and earlier).
  20. As a practical matter, for UTF-8 encoded source code the Visual C++ compiler requires a Byte Order Mark (BOM) in order to correctly deduce the encoding. Some earlier versions of the g++ compiler didn’t support a BOM for UTF-8, but now it does so that it’s not even necessary to do that minimal source code encoding conversion. The same source can be used exactly as-is for both systems.

Notes: 

More fields may be available via dynamicdata ..