Title: Portable String Literals in C++

Author: Martin Moene

Date: 03 August 2013 18:11:08 +01:00 or Sat, 03 August 2013 18:11:08 +01:00

Summary: How hard can it be to make a file in C++ with international text literals in its name? Alf Steinbach shows us.

Body:

C++ lacks a built-in or library-provided character encoding value type that reflects the main conventions for the encoding of international text literals, API arguments and, for *nix, external text, namely UTF-8 for *nix¹ and UTF-16 for Windows². As a consequence, standard C++ code that works fine in *nix fails outright or produces erroneous results in Windows, as exemplified below. Portable code deals with this by converting strings at run time (efficiency/complexity cost), and by employing brittle conventions (programmerâ€™s time cost), and in teaching the problem is largely just ignored, letting students produce programs that, for example, are unable to deal with their Norwegian names (cost of negative perception of the language â€“ a language so primitive that it canâ€™t even handle text).

The C++11 standard added the literal prefixes u8, u and U that specify known sizes and encodings, respectively UTF-8, UTF-16 and UTF-32. But no matter whether one chooses³ u8, u or U, the code needs added runtime conversions on one or the other platform. Exacerbating the situation, the C++ standard library supports only char-based narrow strings in filenames and exception messages, which, for example, means that the current Boost filesystem library⁴ canâ€™t access many Windows files â€“ the main desktop platformâ€™s files â€“ when itâ€™s used with the g++ compiler.

Happily the limited issue of suitable original string data for portable code, with UTF-8 for *nix and UTF-16 for Windows, can be dealt with â€˜simplyâ€™ by using macros that adjust the form of literals. Proper core language support would be better still, but a suitable macro + supporting functionality addresses the problem at compile time, most efficiently, with a single common portable notation. And happily, when the macro always produces a Unicode literal then there is no problem with different character sets (only the encoding differs across systems), and when the macro produces a distinctly typed result⁵ then there is no problem with inadvertent mixing of incompatible encodings such as Windows ANSI and UTF-8.

Relevant character encodings and terminology

In the middle 1960s, US government computers employed a large number of incompatible character encodings, which reduced interoperability and added needless costs and hassle. The American National Standards Institute, ANSI⁶, therefore created a more general single-byte character encoding which became known as ASCII, the American Standard Code for Information Interchange. And on March 11 1960, President Lyndon B. Johnson approved ASCII as a US federal standard.

The ASCII code was English only. So, while ASCII largely solved the Tower of Babel problem within the English-speaking world, the same problem now resurfaced in the rest of the Western world. From this arose a single-byte ASCII extension intended to serve the needs of Western countries, called ISO Latin 1.

The first Windows versions were based on a Microsoft extension of Latin 1 called Windows ANSI. Today that term has taken on a more general meaning (discussed below), and the original Windows ANSI encoding is now known more precisely as Windows ANSI Western, or codepage 1252. A Windows codepage is a number that designates a character encoding in Windows; reportedly it originally referred to a tabular display of a single-byte encoding, literally a â€˜code pageâ€™, like Figure 1.

CP 1252 (Windows ANSI Western ext. of Latin 1)

     0 1 2 3 4 5 6 7 8 9 A B C D E F

00   - - - - - - - - - - - - - - - -
10   - - - - - - - - - - - - - - - -
20     ! " # $ % & ' ( ) * + , - . /
30   0 1 2 3 4 5 6 7 8 9 : ; < = > ?
40   @ A B C D E F G H I J K L M N O
50   P Q R S T U V W X Y Z [ \ ] ^ _
60   ` a b c d e f g h i j k l m n o
70   p q r s t u v w x y z { | } ~ 
80   â‚¬ Â â€š Æ’ " â€¦ â€  â€¡ Ë† â€° Å  â€¹ Å’ Â Å½ Â
90   Â ' ' " " o - - Ëœ â„¢ Å¡ â€º Å“ Â Å¾ Å¸
A0     Â¡ Â¢ Â£ Â¤ Â¥ Â¦ Â§ Â¨ Â© Âª "  Â Â® Â¯
B0   Â° Â± Â² Â³ Â´ Âµ  Â· Â¸ Â¹ Âº " Â¼ Â½ Â¾ Â¿
C0   Ã€ Ã Ã‚ Ãƒ Ã„ Ã… Ã† Ã‡ Ãˆ Ã‰ ÃŠ Ã‹ ÃŒ Ã ÃŽ Ã
D0   Ã Ã‘ Ã’ Ã“ Ã” Ã• Ã– Ã— Ã˜ Ã™ Ãš Ã› Ãœ Ã Ãž ÃŸ
E0   Ã  Ã¡ Ã¢ Ã£ Ã¤ Ã¥ Ã¦ Ã§ Ã¨ Ã© Ãª Ã« Ã¬ Ã Ã® Ã¯
F0   Ã° Ã± Ã² Ã³ Ã´ Ãµ Ã¶ Ã· Ã¸ Ã¹ Ãº Ã» Ã¼ Ã½ Ã¾ Ã¿

Figure 1

In Figure 1, table rows 00H through 70H constitute original ASCII. Rows 80H through F0H were added in ISO Latin 1, except that in ISO Latin 1 rows 80H and 90H are undefined characters. The characters shown in rows 80H and 90H in Figure 1, including the Euro sign â‚¬, are the Windows ANSI Western extension of Latin 1 (in original Windows ANSI there was, of course, no Euro sign, since there was no Euro).

At some point⁷ Windows started supporting local variants of Windows ANSI Western, e.g. with Cyrillic or Greek characters. Whatever narrow encoding used in the GUI, reported by GetACP(), is known as Windows ANSI, as opposed to the OEM character encoding which is the local chosen variant of the original IBM PC encoding, used in text consoles. The different variants of Windows ANSI ensures a global Tower of Babel problem, while the use of two incompatible narrow character encodings on the same machine, namely OEM and Windows ANSI, ensures that there's also a local Tower of Babel problem â€“ at least for Windows users.

To address the general Tower of Babel problem a number of leading computer industry firms cooperated on developing a â€˜universalâ€™ character encoding, an extension of ISO Latin-1 which became known as Unicode. Original Unicode was a fixed size 16-bit per character encoding, and 32-bit Windows NT, introduced in 1992, was based on this 16-bit encoding. However, 16 bits didnâ€™t suffice for e.g. Chinese ideograms, so Unicode was extended to 21 bits per character, and for the existing software the added characters were to be represented as pairs of 16-bit values, called surrogate pairs. Today this encoding is known as UTF-16, and the original 16-bit per character representation is known as UCS-2 (two bytes per character). Windowsâ€™ console subsystem API supports copying of rectangular areas of console windows, but only with 16 bits per character, so console windows are effectively limited to UCS-2, while the rest of Windows is now generally UTF-16.

32-bit Windows includes many wrapper functions that automatically convert from legacy codeâ€™s Windows ANSI to the basic API's UTF-16, and back. Typically there is an UTF-16 based function called FooW, and a Windows ANSI wrapper called FooA. This legacy code support extends to the graphical user interface. However, with respect to window messages (small fixed format data packets used to control windows) Microsoft duplicated its file access API blunder, by using configurable encoding expectations. Pointers in window messages are untyped, and when a given message contains a pointer to a string, then that untyped string is encoded as Windows ANSI or UTF-16 depending on the particular windowâ€™s configurationâ€¦ Thus the terms ANSI window and Unicode window. â€˜Windows ANSIâ€™ refers to the narrow character encoding used in the graphical user interface and reported by the GetACP API function, while â€˜ANSI windowâ€™⁸ refers to a window configured to expect and produce Windows ANSI encoded strings in its window messages.

UTF-8, very popular in *nix and for web pages, is an ASCII extension that encodes all of Unicode by using a variable number of bytes per character.

The inefficiency, complexity and current real world non-portability of standard C++ string literals

Letâ€™s check how some basic, completely standard and therefore presumably automagically⁹ portable C++ source code fares in Windows (see Listing 1).

// Source encoding: UTF 8 with BOM (necessary for
// Visual C++).
#include <assert.h>     // assert
#include <fstream>      // std::ofstream
auto main() -> int
{
  auto const filename = "p.recipe";
  // A pie recipe. :-)
  std::ofstream f( filename );
  assert( "File creation" && !!f );
}

Listing 1

Compiling with the MinGW g++ 4.7.2 compiler (a Windows build of the GNU toolchainâ€™s C++ compiler), running the program and checking the result (see Figure 2).

> del a.exe *.recipe 2>nul &^
More? g++ cplusplus_stdlib_version.cpp &&^
More? a.exe && dir /b *.recipe
Ãâ‚¬.recipe

Figure 2

This produced an erroneous result, a filename different from the specified one, namely Ãâ‚¬.recipe instead of the specified Ï€.recipe.

In some cases, but mostly with Microsoftâ€™s Visual C++, this happens because an UTF-8-encoded source is misinterpreted as a Windows ANSI-encoded source (so itâ€™s worth checking that the source encoding is correct!), but the reason above is that the MinGW g++ compiler and its standard library implementation have different opinions about what the C++ execution character set is or should be.

The g++ compiler defaults to UTF-8, which is the de facto standard narrow string encoding in *nix, while its standard library implementation, presumably delegating to Microsoftâ€™s runtime library, defaults to Windows ANSI, which is the de facto standard narrow string encoding in Windows programming.

Adjusting the g++ compilerâ€™s execution character set to match its standard libraryâ€™s expectations will in general not help in obtaining a correct result, since most variants of Windows ANSI lack the lowercase Greek Ï€ character. But it does convert the silent erroneous result behaviour to a work-saving up-front compilation error. So, when using g++ in Windows, to avoid possible silent erroneous results do add the -fexec-charset=cpYourANSICodepageNumber option, e.g. as shown in Figure 3.

> del a.exe *.recipe 2>nul &^
More? g++ cplusplus_stdlib_version.cpp -fexec-charset=cp1252 &&^
More? a.exe && dir /b *.recipe
cplusplus_stdlib_version.cpp: In function 'int main()':
cplusplus_stdlib_version.cpp:7:27: error: converting to execution character set: Illegal byte sequence	? Nice up-front compilation error.
cplusplus_stdlib_version.cpp:7:27: error: unable to deduce 'const auto' from '<expression error>'

Figure 3

So, how about using Windowsâ€™ own main compiler, Microsoftâ€™s Visual C++, for this code? (See Figure 4.)

> del b.exe *.recipe 2>nul &^
More? cl cplusplus_stdlib_version.cpp /Fe"b.exe" &&^
More? b.exe && dir /b *.recipe
cplusplus_stdlib_version.cpp
cplusplus_stdlib_version.cpp(7) : warning C4566: character represented by universal character name '\u03C0' cannot be represented in the current code page (1252)
Assertion failed: "File creation" && !!f, file cplusplus_stdlib_version.cpp, line 10

Figure 4

Here Visual C++ unfortunately accepted the source code, but happily the program then produced a runtime error. This is far better than g++â€™s default silent erroneous result, but itâ€™s rather ungood news for the portability of pure standard C++ source code as of 2013. Currently, the two main free C++ compilers for Windows are Visual C++ (Microsoft) and g++ (GNU), and as exemplified above neither of them support UTF-8 string constants for e.g. filenames.

Itâ€™s not that Windows canâ€™t handle the Ï€.recipe filename. Unicode filenames are supported by the Windows API, theyâ€™re supported by Windows-specific library extensions such as _wfopen, and thereâ€™s no problem creating or accessing such a file in e.g. Java or C# or Python 3. The problem is that such files canâ€™t be accessed using only portable, pure standard C++ source code, and also that even if C++ had the wide string support that is de facto standard in Windows, using it directly for portable code would be needlessly inefficient in *nix; and the part of that problem that I address here is the support for string literals.

How Boost filesystem doesnâ€™t help

After the C++ standard library the next place to look for general functionality is usually the Boost library. For our example code the relevant sub-library is the Boost filesystem library. The Boost filesystem library, but apparently sans the boost::filesystem::ofstream class¹⁰ thatâ€™s used below, is slated for inclusion in C++ Technical Report 2 (TR2)¹¹, which effectively means also in the next C++ standard.

However, the Boost filesystem library does not offer or visibly use¹² portable system dependent strings, and so for portable code, with the Boost filesystem library a filename such as "Ï€.recipe" has to be specified as a wide string, like L"Ï€.recipe".

Since itâ€™s impractical to deal with two or more different string formats, one would then presumably standardize on using wchar_t based strings for all portable strings. This then incurs a string conversion cost in *nix, in the worst case for most every API call involving strings, which is counter to the general C++ principle of not paying for what you donâ€™t use. This cost (and others) is meant to buy a correct result, so letâ€™s check whether the Boost filesystem library actually does produce a correct result? (Listing 2)

// Source encoding: UTF 8 with BOM 
// (necessary for Visual C++).
#include <assert.h>                 // assert
#include <boost/filesystem/fstream.hpp
 // boost::filesystem::ofstream
namespace bfs = boost::filesystem;
auto main()
  -> int
{
  auto const filename = L"p.recipe"; 
  // A pie recipe. :-)
  bfs::ofstream f( filename );
  assert( "File creation" && !!f );
}

Listing 2

Compiling the program with the Visual C++ 11.0 compiler, using boost 1_54 filesystem and system libraries, and running gives the result shown in Figure 5.

> del b.exe *.recipe 2>nul &^
More? cl cplusplus_boost_version.cpp /MD /Fe"b.exe" /I"%boost_pincludes%" /link %msvc_link_bfs% &&^
More? b.exe && dir /b *.recipe
cplusplus_boost_version.cpp
p.recipe

Figure 5

Well, that worked nicely! At an encoding conversion cost for *nix, and at the general cost of using Boost. But how about building with MinGW g++ 4.7.2? (See Figure 6.)

> del a.exe *.recipe 2>nul &^
More? g++ cplusplus_boost_version.cpp -fexec-charset=cp1252 %gnuc_using_bfs% &&^
More? a.exe && dir /b *.recipe
Assertion failed: "File creation" && !!f, file cplusplus_boost_version.cpp, line 12

This application has requested the Runtime to terminate it in an unusual way.
Please contact the application's support team for more information.

Figure 6

The Boost filesystem library takes advantage of a Visual C++ extension to the standard library, namely a wchar_t based ofstream constructor, when the library is built with Visual C++. The g++ compilerâ€™s standard library implementation has a more clean extension, a std::streambuf subclass that can be initialized from a C FILE*. And a possibly more efficient workaround for the standard libraryâ€™s lack of Unicode filename support, which works with any compiler, is Windowsâ€™ so called â€˜shortâ€™ or â€˜DOSâ€™ or â€˜8+3â€™ filenames, which were used in Boost filesystem version 2.¹³ But the current Boost filesystem library simply doesnâ€™t support Windows C++ compilers in general. For Windows it now only provides full functionality, the ability to portably access files with names such as Ï€.recipe, when itâ€™s used with Visual C++ or a compiler with the same standard library extensionsâ€¦

If the Boost filesystem library is just made part of the C++ standard weâ€™ll then have an absurdity: a part of the standard library making essential use of wide string based constructors, and thus effectively requiring them¹⁴ of all Windows standard library implementations, without having them standardized and available to all.

Summing up, since using Boost filesystem as a portability layer requires using wide strings it incurs an efficiency/complexity cost on *nix, a cost that in a great many cases buys you nothing. And worse, the current version doesnâ€™t even produce correct results with g++ in Windows, thus not providing the goods that the cost was meant to cover. Thus, as of this writing (July 2013) Boost filesystem is not a solution.

Strongly typed system dependent literals

In the same way that C++ integer types such as int are portable because their sizes depend on the system, one can define a character encoding value type¹⁵ thatâ€™s portable because its size and assumed encoding depends usefully on the system. I.e., a system dependent character encoding value type, which is portable precisely because itâ€™s system dependent â€“ just as with the int type etc. It can look like Listing 3.

#ifdef _WIN32
   namespace cppx{ typedef wchar_t Raw_syschar; }
   // Implies UTF-16 encoding.
#  define CPPX_WITH_SYSCHAR_PREFIX( lit ) L##lit
#else 
   namespace cppx{ typedef char Raw_syschar; }
   // Implies UTF-8 encoding.
#  define CPPX_WITH_SYSCHAR_PREFIX( lit ) lit
#endif

Listing 3

The _WIN32 macro, a de facto standard in Windows C and C++ programming, is defined for both 32-bit and 64-bit Windows programming. There is one problem with the Raw_syschar type, though, namely that itâ€™s just a synonym for another type that it isnâ€™t distinct. For example, one cannot define a distinct std::basic_string specialization for it. Itâ€™s practically possible¹⁶ to define a distinct Raw_syschar type as a class, but in order to be able to put that inside a constructor-free union â€“ as can happen with the short string optimization¹⁷, where the union then occurs in the std::basic_string implementation â€“ it would need to be without any user defined constructor. That means that it would need to expose a public data member, which is somewhat unclean, and different from use of basic types like char and wchar_t.

Happily with C++11, and with Visual C++ for a some time before that as a language extension, one can define an enum type with a specified underlying representation (this and all the following definitional code is in namespace cppx):

  enum Syschar : Raw_syschar {};

This produces a type with very much the desired properties¹⁸ of a character encoding value type, namely, itâ€™s a distinct type that supports all the built-in comparison operators, and it provides an implicit conversion to integer.

And just by defining a std::char_traits specialization this type supports a distinct specialization of std::basic_string, if you should want that. Such a std::char_traits specialization is just a collection of static member functions that forward to the corresponding functions for the raw character type. However, such forwarding functions require general conversion between raw and typed characters and character strings - e.g. the following three typed functions for converting to strongly typed form, and corresponding raw functions the other way (Listing 4).¹⁹

auto typed( Raw_syschar const c )
  CPPX_NOEXCEPT
  -> Syschar
{return static_cast< Syschar const >( c );}
auto typed( Raw_syschar* const s )
  CPPX_NOEXCEPT
  -> Syschar*
{return reinterpret_cast< Syschar*>( s );}
auto typed( Raw_syschar const* const s )
  CPPX_NOEXCEPT
  -> Syschar const*
{return reinterpret_cast< Syschar const* >( s );}

Listing 4

This looks trivial, yes?

Unfortunately, in order to later be able to construct a class type string very efficiently from a literal, itâ€™s very desirable to also have a function template like Listing 5, but this function template can then never be implicitly selected. The reason is that for an array type actual argument of any given size the corresponding specialization would not offer a better argument conversion than the pointer argument function. With the specialization the call would therefore be ambiguous. And then the C++11 standard decrees in its Â§13.3.3/1 fifth dash that F1 is a better function than F2 if â€œF1 is a non-template function and F2 is a function template specializationâ€, which for the above functions means that the pointer argument function will always win.

template< Size n >
auto typed( Raw_syschar const (&a)[n] )
  CPPX_NOEXCEPT
  -> Syschar const (&)[n]
{return reinterpret_cast< Syschar 
   const (&)[n] >( a ); }

Listing 5

My chosen fix is to route all calls to functions in a given set (e.g. typed() calls) via a single function template. The template just checks the actual argument type and dispatches the real work. To enable the dispatch callâ€™s function selection each of the typed overloads, and also each of the raw overloads, is outfitted with a defaulted nameless dummy argument that identifies the general kind of actual argument (see Listing 6).

namespace detail {
  ...
  inline
  auto typed( Raw_syschar const* const& s,
              Pointer_kind = Pointer_kind() )
    CPPX_NOEXCEPT
    -> Syschar const* const&
  { return
    reinterpret_cast< Syschar const* const& >
    ( s ); }
  template< Size n >
  auto typed( Raw_syschar const (&a)[n],
              Array_kind = Array_kind() )
    CPPX_NOEXCEPT
    -> Syschar const (&)[n]
  { return reinterpret_cast
     < Syschar const (&)[n] >( a ); }
}  // namespace detail

Listing 6

The function template for this set of functions, through which all typed calls go (Listing 7), where Type_kind_ is part of the small machinery that checks the argument type (see Listing 8).

template< class Arg >
auto typed( Arg const& arg )
  CPPX_NOEXCEPT
  -> decltype( detail::typed( arg,
    typename Type_kind_<Arg>::T() ) )
{ return detail::typed( arg,
    typename Type_kind_<Arg>::T() ); }

Listing 7

#pragma once
// Copyright (c) 2013 Alf P. Steinbach
// Mostly this is to enable a workaround for
// ordinary overload resolution.
#include <rfc/cppx/core/Size.h>  // cppx::Size
namespace cppx {
  enum Value_kind {};
  enum Pointer_kind {};
  enum Array_kind {};
  template< class Type >
  struct Type_kind_ { typedef Value_kind T; };
  template< class Type >
  struct Type_kind_<Type*> {
     typedef Pointer_kind T; };
  template< class Type >
  struct Type_kind_<Type* const> {
     typedef Pointer_kind T; };
  template< class Type, Size n >
  struct Type_kind_< Type[n] > {
     typedef Array_kind T; };
}  // namespace cppx

Listing 8

Listing 9 is the file creation program again, but now using Syschar directly (only the machinery shown so far), producing a correct result. The just-for-this-example ad hoc header x/ofstream.h defines a subclass of std::ofstream called x::ofstream that provides a Syschar-based constructor by employing compiler-specific functionality. The necessity of compiler-specific or at least system-specific code for such basic functionality indicates to me that this area of functionality belongs in the standard.

// Source encoding: UTF 8 with BOM (necessary
// for Visual C++).
#include "x/ofstream.h"     // x::ofstream
#include <assert.h>         // assert

auto main() -> int
{
  using cppx::typed;
  // A pie recipe. :-)
  auto const filename = typed
     ( CPPX_WITH_SYSCHAR_PREFIX( "p.recipe" ) );
  x::ofstream f( filename );
  assert( "File creation" && !!f );
}

Listing 9

But as the declaration of filename in Listing 9 shows, direct use of the conversion functionality defined so far yields rather verbose specifications of literal stringsâ€¦

To support more concise usage expressions I therefore define two further macros, CPPX_U to express a typed literal and CPPX_RAW_U to express an untyped one (Listing 10).

#define CPPX_AS_SYSCHAR( lit ) \
 ::cppx::typed( CPPX_WITH_SYSCHAR_PREFIX( lit ) )

#define CPPX_U      CPPX_AS_SYSCHAR
#define CPPX_RAW_U  CPPX_WITH_SYSCHAR_PREFIX

Listing 10

And with CPPX_U the file creation program looks, to my eyes, acceptable (Listing 11).

// Source encoding: UTF 8 with BOM 
// (necessary for Visual C++).
#include "x/ofstream.h"     // x::ofstream
#include <assert.h>         // assert
auto main() -> int
{
  auto const filename = CPPX_U( "p.recipe" );
  // A pie recipe. :-)
  x::ofstream f( filename );
  assert( "File creation" && !!f );
}

Listing 11

When itâ€™s compiled for Windows this program uses UTF-16 encoded wchar_t based strings, and when itâ€™s compiled for *nix it uses UTF-8 encoded char based strings. Unlike the C++ standard library and unlike Boost filesystem this ensures maximum efficiency for API calls, i.e. no runtime encoding conversion. And also unlike the C++ standard library and unlike Boost filesystem, with the necessary higher level functional support such as exemplified by x::ofstream, it provides access to all valid filenames on each system, lets students almost effortlessly write portable basic C++ programs that can handle Norwegian student names, etc.

Summary and final considerations

Standard C++11 does not provide the means to access Windows files in general, because the filenames canâ€™t be expressed as Windows ANSI encoded char based strings. The Boost filesystem library, slated for inclusion in TR2, imposes an efficiency cost for portable code used in *nix by requiring portable strings to be wchar_t based. And in Windows the Boost filesystem library only supports general Unicode filenames when itâ€™s used with the Visual C++ compiler.

The main idea for the library solution presented here is to use only the portable CPPX_U string notation in the portable code, and to have such strings reinterpreted as system specific char or wchar_t based strings for the system dependent implementation code, if any, and as necessary. By using a character encoding value type thatâ€™s defined differently depending on the system, plus a macro that adds strong typing and an L literal prefix as required for each system, the exact same source code can specify strongly typed string literals with UTF-8 encoding for *nix, and with UTF-16 encoding for Windows. This is maximally efficient for each systemâ€™s API function calls and favoured external text encoding, and makes it technically possible to access all valid filenames on each system, as shown.

To make this work most seamlessly the C++ source code should then be UTF-8 encoded with BOM, because that encoding is accepted and understood by default by both Visual C++²⁰ and g++, and because support for this source encoding is a reasonable requirement for any C++ compiler that one might consider using.

I havenâ€™t found any authoritative statements or data about *nix character encodings other than Markus Kuhnâ€™s Unix Unicode FAQ maintaining that â€œUTF-8 is the way in which Unicode is used under Unix, Linux, and similar systemsâ€. In Nov. 2011 I asked about it on Stack Exchange, but alas without a definitive answer. If youâ€™re interested in various opinions and details then check out that question at: http://unix.stackexchange.com/questions/24529/most-common-encoding-for-strings-in-c-in-linux-and-unix.
The main Windows C++ compiler, Visual C++, supports only Windows ANSI as a narrow C++ execution character set, and UTF-16 for wide string literals. Windows ANSI cannot portably encode international text and incurs conversion costs. UTF-16, in Windows called â€˜Unicodeâ€™, is therefore used by the vast majority of projects, and is the default in Visual Studio projects.
At the time of writing, Visual C++ in version 11.0 does not yet support the C++11 u8, u and U prefixes.
As of Boost version 1.54, released during the writing of this article.
For standard C++ the u8 prefix does produces a char based literal.
At the time known as the United States of America Standards Institute, USASI; the name was changed to the American National Standards Institute, ANSI, in 1969.
According to Wikipediaâ€™s codepage article, at http://en.wikipedia.org/wiki/Code_page, DOS gained codepage support in version 3.3, in 1987, while the first version of Windows was released in 1985.
The term â€˜ANSI Windowsâ€™ was used by one reviewer, who conflated it with â€˜Windows ANSIâ€™ (encoding) and â€˜ANSI windowâ€™ (configuration). This term can appear to be used when â€˜ANSIâ€™ is used as a qualification. E.g. â€˜ANSI Windows codepagesâ€™, meaning â€˜ANSI (Windows codepages)â€™, the codepages that can be used as Windows ANSI, i.e., that can be returned by GetACP.
...to compilers that support UTF-8 source code, which all the relevant compilers do. More in general portable for C++ means portable within the limits of the language implementation that one ports to. E.g., putting this to the point, the C++ standard does not specify the size of bool so that frivolous use of bool type local variables conceivably could exceed the available memory, yet such code is portable. One reviewer has however argued that C++ only supports source code with the characters formally guaranteed to be supported, i.e. only pure ASCII source code with no â€œ$â€ signs, portably.
Judging by the N3693 draft Technical Specification at http://isocpp.org/files/papers/N3693.html
Wikipedia lists the TR2 proposals at http://en.wikipedia.org/wiki/C++_Technical_Report_1#Technical_Report_2
Internally the boost::filesystem::path class uses a representation of international text where the public definition value_type corresponds to the â€˜rawâ€™ encoding value type discussed in this article, with UTF-8 for *nix and UTF-16 for Windows. Presumably with C++14 (if that should be the next C++ standard), this articleâ€™s Raw_syschar could be defined as std::filesystem::path::value_type.
I filed a ticket about its disappearance in 2011, #6065 available at https://svn.boost.org/trac/boost/ticket/6065
The N3693 draft Technical Specification contains this wording in its Â§8.4.6: â€œImplementations of the standard library for systems where string_type is wstring, such as Windows, are encouraged to provide an extension to existing standard library file stream constructors and open functions that adds overloads that accept wstrings for file names. Microsoft and Dinkumware already provide such an extension.â€
The wchar_t type can be argued to be such a type, but itâ€™s impractical for the purpose of portability.
While itâ€™s not guaranteed by the C++ standard, as far as I know thereâ€™s no compiler that by default will yield sizeof(T) > 1 when T is a POD class with just a single char data member.
The last time I checked, two or three years ago, it did happen with Visual C++â€™s std::string.
If names of e.g. control characters are desired then one can use an enum class in order to support easy name qualification, but for this articleâ€™s exposition enum class would not have a purpose.
Here CPP_NOEXCEPT is a macro that depending on the compiler is defined as C++11 noexcept (e.g. for g++ and clang) or C++03 throw() (for Visual C++ 11.0 and earlier).
As a practical matter, for UTF-8 encoded source code the Visual C++ compiler requires a Byte Order Mark (BOM) in order to correctly deduce the encoding. Some earlier versions of the g++ compiler didnâ€™t support a BOM for UTF-8, but now it does so that itâ€™s not even necessary to do that minimal source code encoding conversion. The same source can be used exactly as-is for both systems.

Notes: