Journal Articles
Browse in : |
All
> Journals
> Overload
> o116
(6)
All > Topics > Programming (877) Any of these categories - All of these categories |
Note: when you create a new publication type, the articles module will automatically use the templates user-display-[publicationtype].xt and user-summary-[publicationtype].xt. If those templates do not exist when you try to preview or display a new article, you'll get this warning :-) Please place your own templates in themes/yourtheme/modules/articles . The templates will get the extension .xt there.
Title: Portable String Literals in C++
Author: Martin Moene
Date: 03 August 2013 18:11:08 +01:00 or Sat, 03 August 2013 18:11:08 +01:00
Summary: How hard can it be to make a file in C++ with international text literals in its name? Alf Steinbach shows us.
Body:
C++ lacks a built-in or library-provided character encoding value type that reflects the main conventions for the encoding of international text literals, API arguments and, for *nix, external text, namely UTF-8 for *nix1 and UTF-16 for Windows2. As a consequence, standard C++ code that works fine in *nix fails outright or produces erroneous results in Windows, as exemplified below. Portable code deals with this by converting strings at run time (efficiency/complexity cost), and by employing brittle conventions (programmer’s time cost), and in teaching the problem is largely just ignored, letting students produce programs that, for example, are unable to deal with their Norwegian names (cost of negative perception of the language – a language so primitive that it can’t even handle text).
The C++11 standard added the literal prefixes u8
, u
and U
that specify known sizes and encodings, respectively UTF-8, UTF-16 and UTF-32. But no matter whether one chooses3 u8
, u
or U
, the code needs added runtime conversions on one or the other platform. Exacerbating the situation, the C++ standard library supports only char-based narrow strings in filenames and exception messages, which, for example, means that the current Boost filesystem library4 can’t access many Windows files – the main desktop platform’s files – when it’s used with the g++ compiler.
Happily the limited issue of suitable original string data for portable code, with UTF-8 for *nix and UTF-16 for Windows, can be dealt with ‘simply’ by using macros that adjust the form of literals. Proper core language support would be better still, but a suitable macro + supporting functionality addresses the problem at compile time, most efficiently, with a single common portable notation. And happily, when the macro always produces a Unicode literal then there is no problem with different character sets (only the encoding differs across systems), and when the macro produces a distinctly typed result5 then there is no problem with inadvertent mixing of incompatible encodings such as Windows ANSI and UTF-8.
Relevant character encodings and terminology
In the middle 1960s, US government computers employed a large number of incompatible character encodings, which reduced interoperability and added needless costs and hassle. The American National Standards Institute, ANSI6, therefore created a more general single-byte character encoding which became known as ASCII, the American Standard Code for Information Interchange. And on March 11 1960, President Lyndon B. Johnson approved ASCII as a US federal standard.
The ASCII code was English only. So, while ASCII largely solved the Tower of Babel problem within the English-speaking world, the same problem now resurfaced in the rest of the Western world. From this arose a single-byte ASCII extension intended to serve the needs of Western countries, called ISO Latin 1.
The first Windows versions were based on a Microsoft extension of Latin 1 called Windows ANSI. Today that term has taken on a more general meaning (discussed below), and the original Windows ANSI encoding is now known more precisely as Windows ANSI Western, or codepage 1252. A Windows codepage is a number that designates a character encoding in Windows; reportedly it originally referred to a tabular display of a single-byte encoding, literally a ‘code page’, like Figure 1.
CP 1252 (Windows ANSI Western ext. of Latin 1) |
0 1 2 3 4 5 6 7 8 9 A B C D E F 00 - - - - - - - - - - - - - - - - 10 - - - - - - - - - - - - - - - - 20 ! " # $ % & ' ( ) * + , - . / 30 0 1 2 3 4 5 6 7 8 9 : ; < = > ? 40 @ A B C D E F G H I J K L M N O 50 P Q R S T U V W X Y Z [ \ ] ^ _ 60 ` a b c d e f g h i j k l m n o 70 p q r s t u v w x y z { | } ~ 80 €  ‚ ƒ " … †‡ ˆ ‰ Š‹ Œ  Ž  90  ' ' " " o - - ˜ ™ š › œ  ž Ÿ A0 ¡ ¢ £ ¤ ¥ ¦ § ¨ © ª "  ® ¯ B0 ° ± ² ³ ´ µ · ¸ ¹ º " ¼ ½ ¾ ¿ C0 À à Â Ã Ä Å Æ Ç È É Ê Ë Ì à Î à D0 à Ñ Ò Ó Ô Õ Ö × Ø Ù Ú Û Ü à Þ ß E0 à á â ã ä å æ ç è é ê ë ì à î ï F0 ð ñ ò ó ô õ ö ÷ ø ù ú û ü ý þ ÿ |
Figure 1 |
In Figure 1, table rows 00H
through 70H
constitute original ASCII. Rows 80H
through F0H
were added in ISO Latin 1, except that in ISO Latin 1 rows 80H
and 90H
are undefined characters. The characters shown in rows 80H
and 90H
in Figure 1, including the Euro sign €, are the Windows ANSI Western extension of Latin 1 (in original Windows ANSI there was, of course, no Euro sign, since there was no Euro).
At some point7 Windows started supporting local variants of Windows ANSI Western, e.g. with Cyrillic or Greek characters. Whatever narrow encoding used in the GUI, reported by GetACP()
, is known as Windows ANSI, as opposed to the OEM character encoding which is the local chosen variant of the original IBM PC encoding, used in text consoles. The different variants of Windows ANSI ensures a global Tower of Babel problem, while the use of two incompatible narrow character encodings on the same machine, namely OEM and Windows ANSI, ensures that there's also a local Tower of Babel problem – at least for Windows users.
To address the general Tower of Babel problem a number of leading computer industry firms cooperated on developing a ‘universal’ character encoding, an extension of ISO Latin-1 which became known as Unicode. Original Unicode was a fixed size 16-bit per character encoding, and 32-bit Windows NT, introduced in 1992, was based on this 16-bit encoding. However, 16 bits didn’t suffice for e.g. Chinese ideograms, so Unicode was extended to 21 bits per character, and for the existing software the added characters were to be represented as pairs of 16-bit values, called surrogate pairs. Today this encoding is known as UTF-16, and the original 16-bit per character representation is known as UCS-2 (two bytes per character). Windows’ console subsystem API supports copying of rectangular areas of console windows, but only with 16 bits per character, so console windows are effectively limited to UCS-2, while the rest of Windows is now generally UTF-16.
32-bit Windows includes many wrapper functions that automatically convert from legacy code’s Windows ANSI to the basic API's UTF-16, and back. Typically there is an UTF-16 based function called FooW
, and a Windows ANSI wrapper called FooA
. This legacy code support extends to the graphical user interface. However, with respect to window messages (small fixed format data packets used to control windows) Microsoft duplicated its file access API blunder, by using configurable encoding expectations. Pointers in window messages are untyped, and when a given message contains a pointer to a string, then that untyped string is encoded as Windows ANSI or UTF-16 depending on the particular window’s configuration… Thus the terms ANSI window and Unicode window. ‘Windows ANSI’ refers to the narrow character encoding used in the graphical user interface and reported by the GetACP
API function, while ‘ANSI window’8 refers to a window configured to expect and produce Windows ANSI encoded strings in its window messages.
UTF-8, very popular in *nix and for web pages, is an ASCII extension that encodes all of Unicode by using a variable number of bytes per character.
The inefficiency, complexity and current real world non-portability of standard C++ string literals
Let’s check how some basic, completely standard and therefore presumably automagically9 portable C++ source code fares in Windows (see Listing 1).
// Source encoding: UTF 8 with BOM (necessary for // Visual C++). #include <assert.h> // assert #include <fstream> // std::ofstream auto main() -> int { auto const filename = "p.recipe"; // A pie recipe. :-) std::ofstream f( filename ); assert( "File creation" && !!f ); } |
Listing 1 |
Compiling with the MinGW g++ 4.7.2 compiler (a Windows build of the GNU toolchain’s C++ compiler), running the program and checking the result (see Figure 2).
> del a.exe *.recipe 2>nul &^ More? g++ cplusplus_stdlib_version.cpp &&^ More? a.exe && dir /b *.recipe À.recipe |
Figure 2 |
This produced an erroneous result, a filename different from the specified one, namely À.recipe instead of the specified π.recipe.
In some cases, but mostly with Microsoft’s Visual C++, this happens because an UTF-8-encoded source is misinterpreted as a Windows ANSI-encoded source (so it’s worth checking that the source encoding is correct!), but the reason above is that the MinGW g++ compiler and its standard library implementation have different opinions about what the C++ execution character set is or should be.
The g++ compiler defaults to UTF-8, which is the de facto standard narrow string encoding in *nix, while its standard library implementation, presumably delegating to Microsoft’s runtime library, defaults to Windows ANSI, which is the de facto standard narrow string encoding in Windows programming.
Adjusting the g++ compiler’s execution character set to match its standard library’s expectations will in general not help in obtaining a correct result, since most variants of Windows ANSI lack the lowercase Greek π character. But it does convert the silent erroneous result behaviour to a work-saving up-front compilation error. So, when using g++ in Windows, to avoid possible silent erroneous results do add the -fexec-charset=cp
YourANSICodepageNumber option, e.g. as shown in Figure 3.
> del a.exe *.recipe 2>nul &^ More? g++ cplusplus_stdlib_version.cpp -fexec-charset=cp1252 &&^ More? a.exe && dir /b *.recipe cplusplus_stdlib_version.cpp: In function 'int main()': cplusplus_stdlib_version.cpp:7:27: error: converting to execution character set: Illegal byte sequence ? Nice up-front compilation error. cplusplus_stdlib_version.cpp:7:27: error: unable to deduce 'const auto' from '<expression error>' |
Figure 3 |
So, how about using Windows’ own main compiler, Microsoft’s Visual C++, for this code? (See Figure 4.)
> del b.exe *.recipe 2>nul &^ More? cl cplusplus_stdlib_version.cpp /Fe"b.exe" &&^ More? b.exe && dir /b *.recipe cplusplus_stdlib_version.cpp cplusplus_stdlib_version.cpp(7) : warning C4566: character represented by universal character name '\u03C0' cannot be represented in the current code page (1252) Assertion failed: "File creation" && !!f, file cplusplus_stdlib_version.cpp, line 10 |
Figure 4 |
Here Visual C++ unfortunately accepted the source code, but happily the program then produced a runtime error. This is far better than g++’s default silent erroneous result, but it’s rather ungood news for the portability of pure standard C++ source code as of 2013. Currently, the two main free C++ compilers for Windows are Visual C++ (Microsoft) and g++ (GNU), and as exemplified above neither of them support UTF-8 string constants for e.g. filenames.
It’s not that Windows can’t handle the π.recipe filename. Unicode filenames are supported by the Windows API, they’re supported by Windows-specific library extensions such as _wfopen
, and there’s no problem creating or accessing such a file in e.g. Java or C# or Python 3. The problem is that such files can’t be accessed using only portable, pure standard C++ source code, and also that even if C++ had the wide string support that is de facto standard in Windows, using it directly for portable code would be needlessly inefficient in *nix; and the part of that problem that I address here is the support for string literals.
How Boost filesystem doesn’t help
After the C++ standard library the next place to look for general functionality is usually the Boost library. For our example code the relevant sub-library is the Boost filesystem library. The Boost filesystem library, but apparently sans the boost::filesystem::ofstream
class10 that’s used below, is slated for inclusion in C++ Technical Report 2 (TR2)11, which effectively means also in the next C++ standard.
However, the Boost filesystem library does not offer or visibly use12 portable system dependent strings, and so for portable code, with the Boost filesystem library a filename such as "Ï€.recipe" has to be specified as a wide string, like L"Ï€.recipe".
Since it’s impractical to deal with two or more different string formats, one would then presumably standardize on using wchar_t
based strings for all portable strings. This then incurs a string conversion cost in *nix, in the worst case for most every API call involving strings, which is counter to the general C++ principle of not paying for what you don’t use. This cost (and others) is meant to buy a correct result, so let’s check whether the Boost filesystem library actually does produce a correct result? (Listing 2)
// Source encoding: UTF 8 with BOM // (necessary for Visual C++). #include <assert.h> // assert #include <boost/filesystem/fstream.hpp // boost::filesystem::ofstream namespace bfs = boost::filesystem; auto main() -> int { auto const filename = L"p.recipe"; // A pie recipe. :-) bfs::ofstream f( filename ); assert( "File creation" && !!f ); } |
Listing 2 |
Compiling the program with the Visual C++ 11.0 compiler, using boost 1_54 filesystem and system libraries, and running gives the result shown in Figure 5.
> del b.exe *.recipe 2>nul &^ More? cl cplusplus_boost_version.cpp /MD /Fe"b.exe" /I"%boost_pincludes%" /link %msvc_link_bfs% &&^ More? b.exe && dir /b *.recipe cplusplus_boost_version.cpp p.recipe |
Figure 5 |
Well, that worked nicely! At an encoding conversion cost for *nix, and at the general cost of using Boost. But how about building with MinGW g++ 4.7.2? (See Figure 6.)
> del a.exe *.recipe 2>nul &^ More? g++ cplusplus_boost_version.cpp -fexec-charset=cp1252 %gnuc_using_bfs% &&^ More? a.exe && dir /b *.recipe Assertion failed: "File creation" && !!f, file cplusplus_boost_version.cpp, line 12 This application has requested the Runtime to terminate it in an unusual way. Please contact the application's support team for more information. |
Figure 6 |
The Boost filesystem library takes advantage of a Visual C++ extension to the standard library, namely a wchar_t
based ofstream
constructor, when the library is built with Visual C++. The g++ compiler’s standard library implementation has a more clean extension, a std::streambuf
subclass that can be initialized from a C FILE*
. And a possibly more efficient workaround for the standard library’s lack of Unicode filename support, which works with any compiler, is Windows’ so called ‘short’ or ‘DOS’ or ‘8+3’ filenames, which were used in Boost filesystem version 2.13 But the current Boost filesystem library simply doesn’t support Windows C++ compilers in general. For Windows it now only provides full functionality, the ability to portably access files with names such as π.recipe, when it’s used with Visual C++ or a compiler with the same standard library extensions…
If the Boost filesystem library is just made part of the C++ standard we’ll then have an absurdity: a part of the standard library making essential use of wide string based constructors, and thus effectively requiring them14 of all Windows standard library implementations, without having them standardized and available to all.
Summing up, since using Boost filesystem as a portability layer requires using wide strings it incurs an efficiency/complexity cost on *nix, a cost that in a great many cases buys you nothing. And worse, the current version doesn’t even produce correct results with g++ in Windows, thus not providing the goods that the cost was meant to cover. Thus, as of this writing (July 2013) Boost filesystem is not a solution.
Strongly typed system dependent literals
In the same way that C++ integer types such as int
are portable because their sizes depend on the system, one can define a character encoding value type15 that’s portable because its size and assumed encoding depends usefully on the system. I.e., a system dependent character encoding value type, which is portable precisely because it’s system dependent – just as with the int
type etc. It can look like Listing 3.
#ifdef _WIN32 namespace cppx{ typedef wchar_t Raw_syschar; } // Implies UTF-16 encoding. # define CPPX_WITH_SYSCHAR_PREFIX( lit ) L##lit #else namespace cppx{ typedef char Raw_syschar; } // Implies UTF-8 encoding. # define CPPX_WITH_SYSCHAR_PREFIX( lit ) lit #endif |
Listing 3 |
The _WIN32
macro, a de facto standard in Windows C and C++ programming, is defined for both 32-bit and 64-bit Windows programming. There is one problem with the Raw_syschar
type, though, namely that it’s just a synonym for another type that it isn’t distinct. For example, one cannot define a distinct std::basic_string
specialization for it. It’s practically possible16 to define a distinct Raw_syschar
type as a class, but in order to be able to put that inside a constructor-free union
– as can happen with the short string optimization17, where the union
then occurs in the std::basic_string
implementation – it would need to be without any user defined constructor. That means that it would need to expose a public data member, which is somewhat unclean, and different from use of basic types like char
and wchar_t
.
Happily with C++11, and with Visual C++ for a some time before that as a language extension, one can define an
enum
type with a specified underlying representation (this and all the following definitional code is in namespace cppx
):
enum Syschar : Raw_syschar {};
This produces a type with very much the desired properties18 of a character encoding value type, namely, it’s a distinct type that supports all the built-in comparison operators, and it provides an implicit conversion to integer.
And just by defining a std::char_traits
specialization this type supports a distinct specialization of std::basic_string
, if you should want that. Such a std::char_traits
specialization is just a collection of static member functions that forward to the corresponding functions for the raw character type. However, such forwarding functions require general conversion between raw and typed characters and character strings - e.g. the following three typed functions for converting to strongly typed form, and corresponding raw functions the other way (Listing 4).19
auto typed( Raw_syschar const c ) CPPX_NOEXCEPT -> Syschar {return static_cast< Syschar const >( c );} auto typed( Raw_syschar* const s ) CPPX_NOEXCEPT -> Syschar* {return reinterpret_cast< Syschar*>( s );} auto typed( Raw_syschar const* const s ) CPPX_NOEXCEPT -> Syschar const* {return reinterpret_cast< Syschar const* >( s );} |
Listing 4 |
This looks trivial, yes?
Unfortunately, in order to later be able to construct a class type string very efficiently from a literal, it’s very desirable to also have a function template like Listing 5, but this function template can then never be implicitly selected. The reason is that for an array type actual argument of any given size the corresponding specialization would not offer a better argument conversion than the pointer argument function. With the specialization the call would therefore be ambiguous. And then the C++11 standard decrees in its §13.3.3/1 fifth dash that F1 is a better function than F2 if “F1 is a non-template function and F2 is a function template specializationâ€, which for the above functions means that the pointer argument function will always win.
template< Size n > auto typed( Raw_syschar const (&a)[n] ) CPPX_NOEXCEPT -> Syschar const (&)[n] {return reinterpret_cast< Syschar const (&)[n] >( a ); } |
Listing 5 |
My chosen fix is to route all calls to functions in a given set (e.g. typed()
calls) via a single function template. The template just checks the actual argument type and dispatches the real work. To enable the dispatch call’s function selection each of the typed overloads, and also each of the raw overloads, is outfitted with a defaulted nameless dummy argument that identifies the general kind of actual argument (see Listing 6).
namespace detail { ... inline auto typed( Raw_syschar const* const& s, Pointer_kind = Pointer_kind() ) CPPX_NOEXCEPT -> Syschar const* const& { return reinterpret_cast< Syschar const* const& > ( s ); } template< Size n > auto typed( Raw_syschar const (&a)[n], Array_kind = Array_kind() ) CPPX_NOEXCEPT -> Syschar const (&)[n] { return reinterpret_cast < Syschar const (&)[n] >( a ); } } // namespace detail |
Listing 6 |
The function template for this set of functions, through which all typed calls go (Listing 7), where Type_kind_
is part of the small machinery that checks the argument type (see Listing 8).
template< class Arg > auto typed( Arg const& arg ) CPPX_NOEXCEPT -> decltype( detail::typed( arg, typename Type_kind_<Arg>::T() ) ) { return detail::typed( arg, typename Type_kind_<Arg>::T() ); } |
Listing 7 |
#pragma once // Copyright (c) 2013 Alf P. Steinbach // Mostly this is to enable a workaround for // ordinary overload resolution. #include <rfc/cppx/core/Size.h> // cppx::Size namespace cppx { enum Value_kind {}; enum Pointer_kind {}; enum Array_kind {}; template< class Type > struct Type_kind_ { typedef Value_kind T; }; template< class Type > struct Type_kind_<Type*> { typedef Pointer_kind T; }; template< class Type > struct Type_kind_<Type* const> { typedef Pointer_kind T; }; template< class Type, Size n > struct Type_kind_< Type[n] > { typedef Array_kind T; }; } // namespace cppx |
Listing 8 |
Listing 9 is the file creation program again, but now using Syschar
directly (only the machinery shown so far), producing a correct result. The just-for-this-example ad hoc header x/ofstream.h
defines a subclass of std::ofstream
called x::ofstream
that provides a Syschar
-based constructor by employing compiler-specific functionality. The necessity of compiler-specific or at least system-specific code for such basic functionality indicates to me that this area of functionality belongs in the standard.
// Source encoding: UTF 8 with BOM (necessary // for Visual C++). #include "x/ofstream.h" // x::ofstream #include <assert.h> // assert auto main() -> int { using cppx::typed; // A pie recipe. :-) auto const filename = typed ( CPPX_WITH_SYSCHAR_PREFIX( "p.recipe" ) ); x::ofstream f( filename ); assert( "File creation" && !!f ); } |
Listing 9 |
But as the declaration of filename
in Listing 9 shows, direct use of the conversion functionality defined so far yields rather verbose specifications of literal strings…
To support more concise usage expressions I therefore define two further macros, CPPX_U
to express a typed literal and CPPX_RAW_U
to express an untyped one (Listing 10).
#define CPPX_AS_SYSCHAR( lit ) \ ::cppx::typed( CPPX_WITH_SYSCHAR_PREFIX( lit ) ) #define CPPX_U CPPX_AS_SYSCHAR #define CPPX_RAW_U CPPX_WITH_SYSCHAR_PREFIX |
Listing 10 |
And with CPPX_U
the file creation program looks, to my eyes, acceptable (Listing 11).
// Source encoding: UTF 8 with BOM // (necessary for Visual C++). #include "x/ofstream.h" // x::ofstream #include <assert.h> // assert auto main() -> int { auto const filename = CPPX_U( "p.recipe" ); // A pie recipe. :-) x::ofstream f( filename ); assert( "File creation" && !!f ); } |
Listing 11 |
When it’s compiled for Windows this program uses UTF-16 encoded wchar_t
based strings, and when it’s compiled for *nix it uses UTF-8 encoded char
based strings. Unlike the C++ standard library and unlike Boost filesystem this ensures maximum efficiency for API calls, i.e. no runtime encoding conversion. And also unlike the C++ standard library and unlike Boost filesystem, with the necessary higher level functional support such as exemplified by x::ofstream
, it provides access to all valid filenames on each system, lets students almost effortlessly write portable basic C++ programs that can handle Norwegian student names, etc.
Summary and final considerations
Standard C++11 does not provide the means to access Windows files in general, because the filenames can’t be expressed as Windows ANSI encoded char
based strings. The Boost filesystem library, slated for inclusion in TR2, imposes an efficiency cost for portable code used in *nix by requiring portable strings to be wchar_t
based. And in Windows the Boost filesystem library only supports general Unicode filenames when it’s used with the Visual C++ compiler.
The main idea for the library solution presented here is to use only the portable CPPX_U
string notation in the portable code, and to have such strings reinterpreted as system specific char
or wchar_t
based strings for the system dependent implementation code, if any, and as necessary. By using a character encoding value type that’s defined differently depending on the system, plus a macro that adds strong typing and an L
literal prefix as required for each system, the exact same source code can specify strongly typed string literals with UTF-8 encoding for *nix, and with UTF-16 encoding for Windows. This is maximally efficient for each system’s API function calls and favoured external text encoding, and makes it technically possible to access all valid filenames on each system, as shown.
To make this work most seamlessly the C++ source code should then be UTF-8 encoded with BOM, because that encoding is accepted and understood by default by both Visual C++20 and g++, and because support for this source encoding is a reasonable requirement for any C++ compiler that one might consider using.
- I haven’t found any authoritative statements or data about *nix character encodings other than Markus Kuhn’s Unix Unicode FAQ maintaining that “UTF-8 is the way in which Unicode is used under Unix, Linux, and similar systemsâ€. In Nov. 2011 I asked about it on Stack Exchange, but alas without a definitive answer. If you’re interested in various opinions and details then check out that question at: http://unix.stackexchange.com/questions/24529/most-common-encoding-for-strings-in-c-in-linux-and-unix.
- The main Windows C++ compiler, Visual C++, supports only Windows ANSI as a narrow C++ execution character set, and UTF-16 for wide string literals. Windows ANSI cannot portably encode international text and incurs conversion costs. UTF-16, in Windows called ‘Unicode’, is therefore used by the vast majority of projects, and is the default in Visual Studio projects.
- At the time of writing, Visual C++ in version 11.0 does not yet support the C++11
u8
,u
andU
prefixes. - As of Boost version 1.54, released during the writing of this article.
- For standard C++ the
u8
prefix does produces a char based literal. - At the time known as the United States of America Standards Institute, USASI; the name was changed to the American National Standards Institute, ANSI, in 1969.
- According to Wikipedia’s codepage article, at http://en.wikipedia.org/wiki/Code_page, DOS gained codepage support in version 3.3, in 1987, while the first version of Windows was released in 1985.
- The term ‘ANSI Windows’ was used by one reviewer, who conflated it with ‘Windows ANSI’ (encoding) and ‘ANSI window’ (configuration). This term can appear to be used when ‘ANSI’ is used as a qualification. E.g. ‘ANSI Windows codepages’, meaning ‘ANSI (Windows codepages)’, the codepages that can be used as Windows ANSI, i.e., that can be returned by GetACP.
- ...to compilers that support UTF-8 source code, which all the relevant compilers do. More in general portable for C++ means portable within the limits of the language implementation that one ports to. E.g., putting this to the point, the C++ standard does not specify the size of
bool
so that frivolous use ofbool
type local variables conceivably could exceed the available memory, yet such code is portable. One reviewer has however argued that C++ only supports source code with the characters formally guaranteed to be supported, i.e. only pure ASCII source code with no “$†signs, portably. - Judging by the N3693 draft Technical Specification at http://isocpp.org/files/papers/N3693.html
- Wikipedia lists the TR2 proposals at http://en.wikipedia.org/wiki/C++_Technical_Report_1#Technical_Report_2
- Internally the
boost::filesystem::path
class uses a representation of international text where the public definitionvalue_type
corresponds to the ‘raw’ encoding value type discussed in this article, with UTF-8 for *nix and UTF-16 for Windows. Presumably with C++14 (if that should be the next C++ standard), this article’sRaw_syschar
could be defined asstd::filesystem::path::value_type
. - I filed a ticket about its disappearance in 2011, #6065 available at https://svn.boost.org/trac/boost/ticket/6065
- The N3693 draft Technical Specification contains this wording in its §8.4.6: “Implementations of the standard library for systems where
string_type
iswstring
, such as Windows, are encouraged to provide an extension to existing standard library file stream constructors and open functions that adds overloads that acceptwstring
s for file names. Microsoft and Dinkumware already provide such an extension.†- The
wchar_t
type can be argued to be such a type, but it’s impractical for the purpose of portability. - While it’s not guaranteed by the C++ standard, as far as I know there’s no compiler that by default will yield
sizeof(T)
> 1 whenT
is a POD class with just a singlechar
data member. - The last time I checked, two or three years ago, it did happen with Visual C++’s
std::string
. - If names of e.g. control characters are desired then one can use an
enum class
in order to support easy name qualification, but for this article’s expositionenum class
would not have a purpose. - Here
CPP_NOEXCEPT
is a macro that depending on the compiler is defined as C++11noexcept
(e.g. for g++ and clang) or C++03throw()
(for Visual C++ 11.0 and earlier). - As a practical matter, for UTF-8 encoded source code the Visual C++ compiler requires a Byte Order Mark (BOM) in order to correctly deduce the encoding. Some earlier versions of the g++ compiler didn’t support a BOM for UTF-8, but now it does so that it’s not even necessary to do that minimal source code encoding conversion. The same source can be used exactly as-is for both systems.
Notes:
More fields may be available via dynamicdata ..