    <rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:sy="http://purl.org/rss/1.0/modules/syndication/" xmlns:admin="http://webns.net/mvcb/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:content="http://purl.org/rss/1.0/modules/content/">
     <channel>
        <title>ACCU  :: Portable String Literals in C++</title>
        <link>https://members.accu.org/index.php/journals/1842</link>
        <description>Professionalism in Programming</description>
        <dc:language>en-us</dc:language> 
        <dc:creator>Administrator</dc:creator> 
        <admin:generatorAgent rdf:resource="http://www.xaraya.org" /> 
        <admin:errorReportsTo rdf:resource="mailto:webeditor@accu.org" />
       <sy:updatePeriod>hourly</sy:updatePeriod>
       <sy:updateFrequency>1</sy:updateFrequency>
       <docs>http://backend.userland.com/rss</docs>


        <h2>Journal Articles</h2>


<div class="xar-mod-head"><span class="xar-mod-title">Overload Journal #116 - August 2013 + Programming Topics</span></div>

<table border="0" cellpadding="1" cellspacing="0">
    <tbody>
    <tr>
        <td valign="top">
            Browse in :
       </td>
       <td valign="top">

                                            <a href="https://members.accu.org/index.php/journals/">All</a>

                     &gt;                         <a href="https://members.accu.org/index.php/journals/c76/">Journals</a>

                     &gt;                         <a href="https://members.accu.org/index.php/journals/c78/">Overload</a>

                     &gt;                         <a href="https://members.accu.org/index.php/journals/c328/">o116</a>
                    (6)
<br />

                                            <a href="https://members.accu.org/index.php/journals/">All</a>

                     &gt;                         <a href="https://members.accu.org/index.php/journals/c13/">Topics</a>

                     &gt;                         <a href="https://members.accu.org/index.php/journals/c65/">Programming</a>
                    (877)
<br />

                                            <a href="https://members.accu.org/index.php/journals/c328-65/">Any of these categories</a>

                    -                        <a href="https://members.accu.org/index.php/journals/c328+65/">All of these categories</a>
<br />
</td>
   </tr>
   </tbody>
</table>




<div class="xar-error">
   <p>
 <strong>Note:</strong> when you create a new publication type,
the articles module will automatically use the templates
<em>user-display-[publicationtype].xt</em>
and <em>user-summary-[publicationtype].xt</em>.
If those templates do not exist when you try to preview or display a new article,
you'll get this warning :-)  Please place your own templates in themes/<em>yourtheme</em>/modules/articles . The templates will get the extension .xt there. </p>
</div>
<div class="xar-norm xar-standard-box-padding">
   <h1><strong>Title:</strong>&nbsp;Portable String Literals in C++</h1>
<p><strong>Author:</strong>&nbsp;Martin Moene</p>
<p>
<strong>Date:</strong> 03 August 2013 18:11:08 +01:00 or Sat, 03 August 2013 18:11:08 +01:00</p>
<p><strong>Summary:</strong>&nbsp;How hard can it be to make a file in C++ with international text literals in its name? Alf Steinbach shows us.</p>
<p><strong>Body:</strong>&nbsp;<p>C++ lacks a built-in or library-provided character encoding value type that reflects the main conventions for the encoding of international text literals, API arguments and, for *nix, external text, namely UTF-8 for *nix<a href="#FN01"><sup>1</sup></a> and UTF-16 for Windows<a href="#FN02"><sup>2</sup></a>. As a consequence, standard C++ code that works fine in *nix fails outright or produces erroneous results in Windows, as exemplified below. Portable code deals with this by converting strings at run time (efficiency/complexity cost), and by employing brittle conventions (programmerâ€™s time cost), and in teaching the problem is largely just ignored, letting students produce programs that, for example, are unable to deal with their Norwegian names (cost of negative perception of the language â€“ a language so primitive that it canâ€™t even handle text).</p>

<p>The C++11 standard added the literal prefixes <code>u8</code>, <code>u</code> and <code>U</code> that specify known sizes and encodings, respectively UTF-8, UTF-16 and UTF-32. But no matter whether one chooses<a href="#FN03"><sup>3</sup></a> <code>u8</code>, <code>u</code> or <code>U</code>, the code needs added runtime conversions on one or the other platform. Exacerbating the situation, the C++ standard library supports only char-based narrow strings in filenames and exception messages, which, for example, means that the current Boost filesystem library<a href="#FN04"><sup>4</sup></a> canâ€™t access many Windows files â€“ the main desktop platformâ€™s files â€“ when itâ€™s used with the g++ compiler.</p>

<p>Happily the limited issue of suitable <em>original string data</em> for portable code, with UTF-8 for *nix and UTF-16 for Windows, can be dealt with â€˜simplyâ€™ by using macros that adjust the form of literals. Proper core language support would be better still, but a suitable macro + supporting functionality addresses the problem at compile time, most efficiently, with a single common portable notation. And happily, when the macro always produces a Unicode literal then there is no problem with different character sets (only the encoding differs across systems), and when the macro produces a distinctly typed result<a href="#FN05"><sup>5</sup></a> then there is no problem with inadvertent mixing of incompatible encodings such as Windows ANSI and UTF-8.</p>

<h2>Relevant character encodings and terminology</h2>

<p>In the middle 1960s, US government computers employed a large number of incompatible character encodings, which reduced interoperability and added needless costs and hassle. The <em>American National Standards Institute</em>, <strong>
ANSI</strong><a href="#FN06"><sup>6</sup></a>, therefore created a more general single-byte character encoding which became known as <strong>ASCII</strong>, the <em>American Standard Code for Information Interchange</em>. And on March 11 1960, President Lyndon B. Johnson approved ASCII as a US federal standard.</p>

<p>The ASCII code was English only. So, while ASCII largely solved the Tower of Babel problem within the English-speaking world, the same problem now resurfaced in the rest of the Western world. From this arose a single-byte ASCII extension intended to serve the needs of Western countries, called <strong>ISO Latin 1</strong>.</p>

<p>The first Windows versions were based on a Microsoft extension of Latin 1 called Windows ANSI. Today that term has taken on a more general meaning (discussed below), and the original Windows ANSI encoding is now known more precisely as <strong>Windows ANSI Western</strong>, or codepage 1252. A Windows <strong>codepage</strong> is a number that designates a character encoding in Windows; reportedly it originally referred to a tabular display of a single-byte encoding, literally a â€˜code pageâ€™, like Figure 1.</p>

<table class="sidebartable">
	<tr>
		<td>CP 1252 (Windows ANSI Western ext. of Latin 1)</td>
	</tr>
	<tr>
		<td>
			<pre class="programlisting">
     0 1 2 3 4 5 6 7 8 9 A B C D E F

00   - - - - - - - - - - - - - - - -
10   - - - - - - - - - - - - - - - -
20     ! &quot; # $ % &amp; ' ( ) * + , - . /
30   0 1 2 3 4 5 6 7 8 9 : ; &lt; = &gt; ?
40   @ A B C D E F G H I J K L M N O
50   P Q R S T U V W X Y Z [ \ ] ^ _
60   ` a b c d e f g h i j k l m n o
70   p q r s t u v w x y z { | } ~ 
80   â‚¬ Â â€š Æ’ &quot; â€¦ â€  â€¡ Ë† â€° Å  â€¹ Å’ Â Å½ Â
90   Â ' ' &quot; &quot; o - - Ëœ â„¢ Å¡ â€º Å“ Â Å¾ Å¸
A0     Â¡ Â¢ Â£ Â¤ Â¥ Â¦ Â§ Â¨ Â© Âª &quot;  Â­ Â® Â¯
B0   Â° Â± Â² Â³ Â´ Âµ  Â· Â¸ Â¹ Âº &quot; Â¼ Â½ Â¾ Â¿
C0   Ã€ Ã Ã‚ Ãƒ Ã„ Ã… Ã† Ã‡ Ãˆ Ã‰ ÃŠ Ã‹ ÃŒ Ã ÃŽ Ã
D0   Ã Ã‘ Ã’ Ã“ Ã” Ã• Ã– Ã— Ã˜ Ã™ Ãš Ã› Ãœ Ã Ãž ÃŸ
E0   Ã  Ã¡ Ã¢ Ã£ Ã¤ Ã¥ Ã¦ Ã§ Ã¨ Ã© Ãª Ã« Ã¬ Ã­ Ã® Ã¯
F0   Ã° Ã± Ã² Ã³ Ã´ Ãµ Ã¶ Ã· Ã¸ Ã¹ Ãº Ã» Ã¼ Ã½ Ã¾ Ã¿
			</pre>
		</td>
	</tr>
	<tr>
		<td class="title">Figure 1</td>
	</tr>
</table>

<p>In Figure 1, table rows <code>00H</code> through <code>70H</code> constitute original ASCII. Rows <code>80H</code> through <code>F0H</code> were added in ISO Latin 1, except that in ISO Latin 1 rows <code>80H</code> and <code>90H</code> are undefined characters. The characters shown in rows <code>80H</code> and <code>90H</code> in Figure 1, including the Euro sign â‚¬, are the Windows ANSI Western extension of Latin 1 (in original Windows ANSI there was, of course, no Euro sign, since there was no Euro).</p>

<p>At some point<a href="#FN07"><sup>7</sup></a> Windows started supporting local variants of Windows ANSI Western, e.g. with Cyrillic or Greek characters. Whatever narrow encoding used in the GUI, reported by <code>GetACP()</code>, is known as <strong>Windows ANSI</strong>, as opposed to the <strong>OEM</strong> character encoding which is the local chosen variant of the original IBM PC encoding, used in text consoles. The different variants of Windows ANSI ensures a global Tower of Babel problem, while the use of two incompatible narrow character encodings on the same machine, namely OEM and Windows ANSI, ensures that there's also a local Tower of Babel problem â€“ at least for Windows users.</p>

<p>To address the general Tower of Babel problem a number of leading computer industry firms cooperated on developing a â€˜universalâ€™ character encoding, an extension of ISO Latin-1 which became known as <strong>Unicode</strong>. Original Unicode was a fixed size 16-bit per character encoding, and 32-bit Windows NT, introduced in 1992, was based on this 16-bit encoding. However, 16 bits didnâ€™t suffice for e.g. Chinese ideograms, so Unicode was extended to 21 bits per character, and for the existing software the added characters were to be represented as <em>pairs</em> of 16-bit values, called <strong>surrogate pairs</strong>. Today this encoding is known as <strong>UTF-16</strong>, and the original 16-bit per character representation is known as <strong>UCS-2</strong> (two bytes per character). Windowsâ€™ console subsystem API supports copying of rectangular areas of console windows, but only with 16 bits per character, so console windows are effectively limited to UCS-2, while the rest of Windows is now generally UTF-16.</p>

<p>32-bit Windows includes many wrapper functions that automatically convert from legacy codeâ€™s Windows ANSI to the basic API's UTF-16, and back. Typically there is an UTF-16 based function called <code>FooW</code>, and a Windows ANSI wrapper called <code>FooA</code>. This legacy code support extends to the graphical user interface. However, with respect to <em>window messages</em> (small fixed format data packets used to control windows) Microsoft duplicated its file access API blunder, by using configurable encoding expectations. Pointers in window messages are untyped, and when a given message contains a pointer to a string, then that untyped string is encoded as Windows ANSI or UTF-16 depending on the particular windowâ€™s configurationâ€¦ Thus the terms <strong>ANSI window</strong> and <strong>Unicode window</strong>. â€˜Windows ANSIâ€™ refers to the narrow character encoding used in the graphical user interface and reported by the <code>GetACP</code> API function, while â€˜ANSI windowâ€™<a href="#FN08"><sup>8</sup></a> refers to a window configured to expect and produce Windows ANSI encoded strings in its window messages.</p>

<p><strong>UTF-8</strong>, very popular in *nix and for web pages, is an ASCII extension that encodes all of Unicode by using a variable number of bytes per character.</p>

<h2>The inefficiency, complexity and current real world non-portability of standard C++ string literals</h2>

<p>Letâ€™s check how some basic, completely standard and therefore presumably automagically<a href="#FN09"><sup>9</sup></a> portable C++ source code fares in Windows (see Listing 1).</p>

<table class="sidebartable">
	<tr>
		<td>
			<pre class="programlisting">
// Source encoding: UTF 8 with BOM (necessary for
// Visual C++).
#include &lt;assert.h&gt;     // assert
#include &lt;fstream&gt;      // std::ofstream
auto main() -&gt; int
{
  auto const filename = &quot;p.recipe&quot;;
  // A pie recipe. :-)
  std::ofstream f( filename );
  assert( &quot;File creation&quot; &amp;&amp; !!f );
}
			</pre>
		</td>
	</tr>
	<tr>
		<td class="title">Listing 1</td>
	</tr>
</table>

<p>Compiling with the MinGW g++ 4.7.2 compiler (a Windows build of the GNU toolchainâ€™s C++ compiler), running the program and checking the result (see Figure 2).</p>

<table class="sidebartable">
	<tr>
		<td>
			<pre class="programlisting">
&gt; del a.exe *.recipe 2&gt;nul &amp;^
More? g++ cplusplus_stdlib_version.cpp &amp;&amp;^
More? a.exe &amp;&amp; dir /b *.recipe
Ãâ‚¬.recipe
			</pre>
		</td>
	</tr>
	<tr>
		<td class="title">Figure 2</td>
	</tr>
</table>

<p>This produced an <em>erroneous</em> result, a filename different from the specified one, namely <span class="filename">
Ãâ‚¬.recipe</span> instead of the specified <span class="filename">Ï€.recipe</span>.</p>

<p>In some cases, but mostly with Microsoftâ€™s Visual C++, this happens because an UTF-8-encoded source is misinterpreted as a Windows ANSI-encoded source (so itâ€™s worth checking that the source encoding is correct!), but the reason above is that the MinGW g++ compiler and its standard library implementation have <em>different opinions</em> about what the C++ <strong>execution character set</strong> is or should be.</p>

<p>The g++ compiler defaults to UTF-8, which is the <em>de facto</em> standard narrow string encoding in *nix, while its standard library implementation, presumably delegating to Microsoftâ€™s runtime library, defaults to <strong>Windows ANSI</strong>, which is the <em>de facto </em>standard narrow string encoding in Windows programming.</p>

<p>Adjusting the g++ compilerâ€™s execution character set to match its standard libraryâ€™s expectations will in general not help in obtaining a correct result, since most variants of Windows ANSI lack the lowercase Greek Ï€ character. But it does convert the silent erroneous result behaviour to a work-saving up-front <em>compilation error</em>. So, when using g++ in Windows, to avoid possible silent erroneous results do add the <code>-fexec-charset=cp</code><em>YourANSICodepageNumber</em> option, e.g. as shown in Figure 3.</p>

<table class="sidebartable">
	<tr>
		<td>
			<pre class="programlisting">
&gt; del a.exe *.recipe 2&gt;nul &amp;^
More? g++ cplusplus_stdlib_version.cpp -fexec-charset=cp1252 &amp;&amp;^
More? a.exe &amp;&amp; dir /b *.recipe
cplusplus_stdlib_version.cpp: In function 'int main()':
cplusplus_stdlib_version.cpp:7:27: error: converting to execution character set: Illegal byte sequence	? Nice up-front compilation error.
cplusplus_stdlib_version.cpp:7:27: error: unable to deduce 'const auto' from '&lt;expression error&gt;'
			</pre>
		</td>
	</tr>
	<tr>
		<td class="title">Figure 3</td>
	</tr>
</table>

<p>So, how about using Windowsâ€™ own main compiler, Microsoftâ€™s Visual C++, for this code? (See Figure 4.)</p>

<table class="sidebartable">
	<tr>
		<td>
			<pre class="programlisting">
&gt; del b.exe *.recipe 2&gt;nul &amp;^
More? cl cplusplus_stdlib_version.cpp /Fe&quot;b.exe&quot; &amp;&amp;^
More? b.exe &amp;&amp; dir /b *.recipe
cplusplus_stdlib_version.cpp
cplusplus_stdlib_version.cpp(7) : warning C4566: character represented by universal character name '\u03C0' cannot be represented in the current code page (1252)
Assertion failed: &quot;File creation&quot; &amp;&amp; !!f, file cplusplus_stdlib_version.cpp, line 10
			</pre>
		</td>
	</tr>
	<tr>
		<td class="title">Figure 4</td>
	</tr>
</table>

<p>Here Visual C++ unfortunately accepted the source code, but happily the program then produced a <em>runtime error</em>. This is far better than g++â€™s default silent erroneous result, but itâ€™s rather ungood news for the portability of pure standard C++ source code as of 2013. Currently, the two main free C++ compilers for Windows are Visual C++ (Microsoft) and g++ (GNU), and as exemplified above neither of them support UTF-8 string constants for e.g. filenames.</p>

<p>Itâ€™s not that Windows canâ€™t handle the <span class="filename">Ï€.recipe</span> filename. Unicode filenames are supported by the Windows API, theyâ€™re supported by Windows-specific library extensions such as <code>_wfopen</code>, and thereâ€™s no problem creating or accessing such a file in e.g. Java or C# or Python 3. The problem is that such files canâ€™t be accessed using only portable, pure <strong>standard C++ </strong>source code, and also that even if C++ had the wide string support that is <em>de facto</em> standard in Windows, using it directly for portable code would be needlessly inefficient in *nix; and the part of that problem that I address here is the support for string literals.</p>

<h2>How Boost filesystem doesnâ€™t help</h2>

<p>After the C++ standard library the next place to look for general functionality is usually the Boost library. For our example code the relevant sub-library is the Boost filesystem library. The Boost filesystem library, but apparently sans the <code>boost::filesystem::ofstream</code> class<a href="#FN10"><sup>10</sup></a> thatâ€™s used below, is slated for inclusion in <em>C++ Technical Report 2</em> (TR2)<a href="#FN11"><sup>11</sup></a>, which effectively means also in the next C++ standard.</p>

<p>However, the Boost filesystem library does not offer or visibly use<a href="#FN12"><sup>12</sup></a> portable system dependent strings, and so for portable code, with the Boost filesystem library a filename such as <span class="filename">&quot;Ï€.recipe&quot;</span> has to be specified as a wide string, like <span class="filename">L&quot;Ï€.recipe&quot;</span>.</p>

<p>Since itâ€™s impractical to deal with two or more different string formats, one would then presumably standardize on using <code>wchar_t</code> based strings for all portable strings. This then incurs a <em>string conversion cost</em> in *nix, in the worst case for most every API call involving strings, which is counter to the general C++ principle of not paying for what you donâ€™t use. This cost (and others) is meant to buy a correct result, so letâ€™s check whether the Boost filesystem library actually does produce a correct result? (Listing 2)</p>

<table class="sidebartable">
	<tr>
		<td>
			<pre class="programlisting">
// Source encoding: UTF 8 with BOM 
// (necessary for Visual C++).
#include &lt;assert.h&gt;                 // assert
#include &lt;boost/filesystem/fstream.hpp
 // boost::filesystem::ofstream
namespace bfs = boost::filesystem;
auto main()
  -&gt; int
{
  auto const filename = L&quot;p.recipe&quot;; 
  // A pie recipe. :-)
  bfs::ofstream f( filename );
  assert( &quot;File creation&quot; &amp;&amp; !!f );
}
			</pre>
		</td>
	</tr>
	<tr>
		<td class="title">Listing 2</td>
	</tr>
</table>

<p>Compiling the program with the Visual C++ 11.0 compiler, using boost 1_54 filesystem and system libraries, and running gives the result shown in Figure 5.</p>

<table class="sidebartable">
	<tr>
		<td>
			<pre class="programlisting">
&gt; del b.exe *.recipe 2&gt;nul &amp;^
More? cl cplusplus_boost_version.cpp /MD /Fe&quot;b.exe&quot; /I&quot;%boost_pincludes%&quot; /link %msvc_link_bfs% &amp;&amp;^
More? b.exe &amp;&amp; dir /b *.recipe
cplusplus_boost_version.cpp
p.recipe
			</pre>
		</td>
	</tr>
	<tr>
		<td class="title">Figure 5</td>
	</tr>
</table>

<p>Well, that worked nicely! At an encoding conversion cost for *nix, and at the general cost of using Boost. But how about building with MinGW g++ 4.7.2? (See Figure 6.)</p>

<table class="sidebartable">
	<tr>
		<td>
			<pre class="programlisting">
&gt; del a.exe *.recipe 2&gt;nul &amp;^
More? g++ cplusplus_boost_version.cpp -fexec-charset=cp1252 %gnuc_using_bfs% &amp;&amp;^
More? a.exe &amp;&amp; dir /b *.recipe
Assertion failed: &quot;File creation&quot; &amp;&amp; !!f, file cplusplus_boost_version.cpp, line 12

This application has requested the Runtime to terminate it in an unusual way.
Please contact the application's support team for more information.
			</pre>
		</td>
	</tr>
	<tr>
		<td class="title">Figure 6</td>
	</tr>
</table>

<p>The Boost filesystem library takes advantage of a Visual C++ extension to the standard library, namely a <code>wchar_t</code> based <code>ofstream</code> constructor, when the library is built with Visual C++. The g++ compilerâ€™s standard library implementation has a more clean extension, a <code>std::streambuf</code> subclass that can be initialized from a C <code>FILE*</code>. And a possibly more efficient workaround for the standard libraryâ€™s lack of Unicode filename support, which works with any compiler, is Windowsâ€™ so called â€˜shortâ€™ or â€˜DOSâ€™ or â€˜8+3â€™ filenames, which <em>were</em> used in Boost filesystem version 2.<a href="#FN13"><sup>13</sup></a> But the current Boost filesystem library simply doesnâ€™t support Windows C++ compilers in general. For Windows it now only provides full functionality, the ability to portably access files with names such as <span class="filename">Ï€.recipe</span>, when itâ€™s used with Visual C++ or a compiler with the same standard library extensionsâ€¦</p>

<p>If the Boost filesystem library is just made part of the C++ standard weâ€™ll then have an absurdity: a part of the standard library making essential use of wide string based constructors, and thus effectively requiring them<a href="#FN14"><sup>14</sup></a> of all Windows standard library implementations, without having them standardized and available to all.</p>

<p>Summing up, since using Boost filesystem as a portability layer requires using wide strings it incurs an efficiency/complexity cost on *nix, a cost that in a great many cases buys you nothing. And worse, the current version doesnâ€™t even produce correct results with g++ in Windows, thus not providing the goods that the cost was meant to cover. Thus, as of this writing (July 2013) Boost filesystem is not a solution.</p>

<h2>Strongly typed system dependent literals</h2>

<p>In the same way that C++ integer types such as <code>int</code> are portable because their sizes depend on the system, one can define a character encoding value type<a href="#FN15"><sup>15</sup></a> thatâ€™s portable because its size and assumed encoding depends usefully on the system. I.e., a system dependent character encoding value type, which is portable precisely because itâ€™s system dependent â€“ just as with the <code>int</code> type etc. It can look like Listing 3.</p>

<table class="sidebartable">
	<tr>
		<td>
			<pre class="programlisting">
#ifdef _WIN32
   namespace cppx{ typedef wchar_t Raw_syschar; }
   // Implies UTF-16 encoding.
#  define CPPX_WITH_SYSCHAR_PREFIX( lit ) L##lit
#else 
   namespace cppx{ typedef char Raw_syschar; }
   // Implies UTF-8 encoding.
#  define CPPX_WITH_SYSCHAR_PREFIX( lit ) lit
#endif
			</pre>
		</td>
	</tr>
	<tr>
		<td class="title">Listing 3</td>
	</tr>
</table>

<p>The <code>_WIN32</code> macro, a <em>de facto</em> standard in Windows C and C++ programming, is defined for both 32-bit and 64-bit Windows programming. There is one problem with the <code>Raw_syschar</code> type, though, namely that itâ€™s just a synonym for another type that it isnâ€™t <strong>distinct</strong>. For example, one cannot define a distinct <code>std::basic_string</code> specialization for it. Itâ€™s practically possible<a href="#FN16"><sup>16</sup></a> to define a distinct <code>Raw_syschar</code> type as a class, but in order to be able to put that inside a constructor-free <code>union</code> â€“ as can happen with the short string optimization<a href="#FN17"><sup>17</sup></a>, where the <code>union</code> then occurs in the <code>std::basic_string</code> implementation â€“ it would need to be without any user defined constructor. That means that it would need to expose a public data member, which is somewhat unclean, and different from use of basic types like <code>char</code> and <code>wchar_t</code>.</p>

<p>Happily with C++11, and with Visual C++ for a some time before that as a language extension, one can define an <code>
enum</code> type with a specified underlying representation (this and all the following definitional code is in namespace <code>cppx</code>):</p>

<pre class="programlisting">
  enum Syschar : Raw_syschar {};
</pre>
  
<p>This produces a type with very much the desired properties<a href="#FN18"><sup>18</sup></a> of a character encoding value type, namely, itâ€™s a distinct type that supports all the built-in comparison operators, and it provides an implicit conversion to integer.</p>

<p>And just by defining a <code>std::char_traits</code> specialization this type supports a distinct specialization of <code>std::basic_string</code>, if you should want that. Such a <code>std::char_traits</code> specialization is just a collection of static member functions that forward to the corresponding functions for the raw character type. However, such forwarding functions require general conversion between raw and typed characters and character strings - e.g. the following three <strong>typed</strong> functions for converting to strongly typed form, and corresponding <strong>raw</strong> functions the other way (Listing 4).<a href="#FN19"><sup>19</sup></a></p>

<table class="sidebartable">
	<tr>
		<td>
			<pre class="programlisting">
auto typed( Raw_syschar const c )
  CPPX_NOEXCEPT
  -&gt; Syschar
{return static_cast&lt; Syschar const &gt;( c );}
auto typed( Raw_syschar* const s )
  CPPX_NOEXCEPT
  -&gt; Syschar*
{return reinterpret_cast&lt; Syschar*&gt;( s );}
auto typed( Raw_syschar const* const s )
  CPPX_NOEXCEPT
  -&gt; Syschar const*
{return reinterpret_cast&lt; Syschar const* &gt;( s );}
			</pre>
		</td>
	</tr>
	<tr>
		<td class="title">Listing 4</td>
	</tr>
</table>

<p>This looks trivial, yes?</p>

<p>Unfortunately, in order to later be able to construct a class type string very efficiently from a literal, itâ€™s very desirable to also have a function template like Listing 5, but this function template can then never be implicitly selected. The reason is that for an array type actual argument of any given size the corresponding specialization would not offer a better argument conversion than the pointer argument function. With the specialization the call would therefore be ambiguous. And then the C++11 standard decrees in its Â§13.3.3/1 fifth dash that F1 is a better function than F2 if â€œ<em>F1 is a non-template function and F2 is a function template specialization</em>â€, which for the above functions means that the pointer argument function will always win.</p>

<table class="sidebartable">
	<tr>
		<td>
			<pre class="programlisting">
template&lt; Size n &gt;
auto typed( Raw_syschar const (&amp;a)[n] )
  CPPX_NOEXCEPT
  -&gt; Syschar const (&amp;)[n]
{return reinterpret_cast&lt; Syschar 
   const (&amp;)[n] &gt;( a ); }
			</pre>
		</td>
	</tr>
	<tr>
		<td class="title">Listing 5</td>
	</tr>
</table>

<p>My chosen fix is to route all calls to functions in a given set (e.g. <code>typed()</code> calls) via a single function template. The template just checks the actual argument type and dispatches the real work. To enable the dispatch callâ€™s function selection each of the typed overloads, and also each of the raw overloads, is outfitted with a defaulted nameless <strong>dummy argument</strong> that identifies the general kind of actual argument (see Listing 6).</p>

<table class="sidebartable">
	<tr>
		<td>
			<pre class="programlisting">
namespace detail {
  ...
  inline
  auto typed( Raw_syschar const* const&amp; s,
              Pointer_kind = Pointer_kind() )
    CPPX_NOEXCEPT
    -&gt; Syschar const* const&amp;
  { return
    reinterpret_cast&lt; Syschar const* const&amp; &gt;
    ( s ); }
  template&lt; Size n &gt;
  auto typed( Raw_syschar const (&amp;a)[n],
              Array_kind = Array_kind() )
    CPPX_NOEXCEPT
    -&gt; Syschar const (&amp;)[n]
  { return reinterpret_cast
     &lt; Syschar const (&amp;)[n] &gt;( a ); }
}  // namespace detail
			</pre>
		</td>
	</tr>
	<tr>
		<td class="title">Listing 6</td>
	</tr>
</table>

<p>The function template for this set of functions, through which all typed calls go (Listing 7), where <code>Type_kind_</code> is part of the small machinery that checks the argument type (see Listing 8).</p>

<table class="sidebartable">
	<tr>
		<td>
			<pre class="programlisting">
template&lt; class Arg &gt;
auto typed( Arg const&amp; arg )
  CPPX_NOEXCEPT
  -&gt; decltype( detail::typed( arg,
    typename Type_kind_&lt;Arg&gt;::T() ) )
{ return detail::typed( arg,
    typename Type_kind_&lt;Arg&gt;::T() ); }
			</pre>
		</td>
	</tr>
	<tr>
		<td class="title">Listing 7</td>
	</tr>
</table>

<table class="sidebartable">
	<tr>
		<td>
			<pre class="programlisting">
#pragma once
// Copyright (c) 2013 Alf P. Steinbach
// Mostly this is to enable a workaround for
// ordinary overload resolution.
#include &lt;rfc/cppx/core/Size.h&gt;  // cppx::Size
namespace cppx {
  enum Value_kind {};
  enum Pointer_kind {};
  enum Array_kind {};
  template&lt; class Type &gt;
  struct Type_kind_ { typedef Value_kind T; };
  template&lt; class Type &gt;
  struct Type_kind_&lt;Type*&gt; {
     typedef Pointer_kind T; };
  template&lt; class Type &gt;
  struct Type_kind_&lt;Type* const&gt; {
     typedef Pointer_kind T; };
  template&lt; class Type, Size n &gt;
  struct Type_kind_&lt; Type[n] &gt; {
     typedef Array_kind T; };
}  // namespace cppx
			</pre>
		</td>
	</tr>
	<tr>
		<td class="title">Listing 8</td>
	</tr>
</table>

<p>Listing 9 is the file creation program again, but now using <code>Syschar</code> directly (only the machinery shown so far), producing a correct result. The just-for-this-example ad hoc header <code>x/ofstream.h</code> defines a subclass of <code>std::ofstream</code> called <code>x::ofstream</code> that provides a <code>Syschar</code>-based constructor by employing compiler-specific functionality. The necessity of compiler-specific or at least system-specific code for such basic functionality indicates to me that this area of functionality belongs in the standard.</p>

<table class="sidebartable">
	<tr>
		<td>
			<pre class="programlisting">
// Source encoding: UTF 8 with BOM (necessary
// for Visual C++).
#include &quot;x/ofstream.h&quot;     // x::ofstream
#include &lt;assert.h&gt;         // assert

auto main() -&gt; int
{
  using cppx::typed;
  // A pie recipe. :-)
  auto const filename = typed
     ( CPPX_WITH_SYSCHAR_PREFIX( &quot;p.recipe&quot; ) );
  x::ofstream f( filename );
  assert( &quot;File creation&quot; &amp;&amp; !!f );
}
			</pre>
		</td>
	</tr>
	<tr>
		<td class="title">Listing 9</td>
	</tr>
</table>

<p>But as the declaration of <code>filename</code> in Listing 9 shows, direct use of the conversion functionality defined so far yields <em>rather verbose</em> specifications of literal stringsâ€¦</p>

<p>To support more concise usage expressions I therefore define two further macros, <code>CPPX_U</code> to express a typed literal and <code>CPPX_RAW_U</code> to express an untyped one (Listing 10).</p>

<table class="sidebartable">
	<tr>
		<td>
			<pre class="programlisting">
#define CPPX_AS_SYSCHAR( lit ) \
 ::cppx::typed( CPPX_WITH_SYSCHAR_PREFIX( lit ) )

#define CPPX_U      CPPX_AS_SYSCHAR
#define CPPX_RAW_U  CPPX_WITH_SYSCHAR_PREFIX
			</pre>
		</td>
	</tr>
	<tr>
		<td class="title">Listing 10</td>
	</tr>
</table>

<p>And with <code>CPPX_U</code> the file creation program looks, to my eyes, acceptable (Listing 11).</p>

<table class="sidebartable">
	<tr>
		<td>
			<pre class="programlisting">
// Source encoding: UTF 8 with BOM 
// (necessary for Visual C++).
#include &quot;x/ofstream.h&quot;     // x::ofstream
#include &lt;assert.h&gt;         // assert
auto main() -&gt; int
{
  auto const filename = CPPX_U( &quot;p.recipe&quot; );
  // A pie recipe. :-)
  x::ofstream f( filename );
  assert( &quot;File creation&quot; &amp;&amp; !!f );
}

			</pre>
		</td>
	</tr>
	<tr>
		<td class="title">Listing 11</td>
	</tr>
</table>

<p>When itâ€™s compiled for Windows this program uses UTF-16 encoded <code>wchar_t</code> based strings, and when itâ€™s compiled for *nix it uses UTF-8 encoded <code>char</code> based strings. Unlike the C++ standard library and unlike Boost filesystem this ensures maximum efficiency for API calls, i.e. no runtime encoding conversion. And also unlike the C++ standard library and unlike Boost filesystem, with the necessary higher level functional support such as exemplified by <code>x::ofstream</code>, it provides access to all valid filenames on each system, lets students almost effortlessly write portable basic C++ programs that can handle Norwegian student names, etc.</p>

<h2>Summary and final considerations</h2>

<p>Standard C++11 does not provide the means to access Windows files in general, because the filenames canâ€™t be expressed as Windows ANSI encoded <code>char</code> based strings. The Boost filesystem library, slated for inclusion in TR2, imposes an efficiency cost for portable code used in *nix by requiring portable strings to be <code>wchar_t </code>based. And in Windows the Boost filesystem library only supports general Unicode filenames when itâ€™s used with the Visual C++ compiler.</p>

<p>The main idea for the library solution presented here is to use only the portable <code>CPPX_U</code> string notation in the portable code, and to have such strings reinterpreted as system specific <code>char</code> or <code>wchar_t</code> based strings for the system dependent implementation code, if any, and as necessary. By using a character encoding value type thatâ€™s defined differently depending on the system, plus a macro that adds strong typing and an <code>L </code>literal prefix as required for each system, the exact same source code can specify strongly typed string literals with UTF-8 encoding for *nix, and with UTF-16 encoding for Windows. This is maximally efficient for each systemâ€™s API function calls and favoured external text encoding, and makes it technically possible to access all valid filenames on each system, as shown.</p>

<p>To make this work most seamlessly the C++ source code should then be UTF-8 encoded with BOM, because that encoding is accepted and understood by default by both Visual C++<a href="#FN20"><sup>20</sup></a> and g++, and because support for this source encoding is a reasonable requirement for any C++ compiler that one might consider using.</p>

<p class="footnotes"></p>

<ol>

<li><a id="FN01"></a>I havenâ€™t found any authoritative statements or data about *nix character encodings other than Markus Kuhnâ€™s Unix Unicode FAQ maintaining that â€œUTF-8 is the way in which Unicode is used under Unix, Linux, and similar systemsâ€. In Nov. 2011 I asked about it on Stack Exchange, but alas without a definitive answer. If youâ€™re interested in various opinions and details then check out that question at: <a href="http://unix.stackexchange.com/questions/24529/most-common-encoding-for-strings-in-c-in-linux-and-unix">http://unix.stackexchange.com/questions/24529/most-common-encoding-for-strings-in-c-in-linux-and-unix</a>.</li>

<li><a id="FN02"></a>The main Windows C++ compiler, Visual C++, supports only Windows ANSI as a narrow C++ execution character set, and UTF-16 for wide string literals. Windows ANSI cannot portably encode international text and incurs conversion costs. UTF-16, in Windows called â€˜Unicodeâ€™, is therefore used by the vast majority of projects, and is the default in Visual Studio projects.</li>

<li><a id="FN03"></a>At the time of writing, Visual C++ in version 11.0 does not yet support the C++11 <code>u8</code>, <code>u</code> and <code>U</code> prefixes.</li>

<li><a id="FN04"></a>As of Boost version 1.54, released during the writing of this article.</li>

<li><a id="FN05"></a>For standard C++ the <code>u8</code> prefix does produces a char based literal.</li>

<li><a id="FN06"></a>At the time known as the United States of America Standards Institute, USASI; the name was changed to the American National Standards Institute, ANSI, in 1969.</li>

<li><a id="FN07"></a>According to Wikipediaâ€™s codepage article, at http://en.wikipedia.org/wiki/Code_page, DOS gained codepage support in version 3.3, in 1987, while the first version of Windows was released in 1985.</li>

<li><a id="FN08"></a>The term â€˜ANSI Windowsâ€™ was used by one reviewer, who conflated it with â€˜Windows ANSIâ€™ (encoding) and â€˜ANSI windowâ€™ (configuration). This term can appear to be used when â€˜ANSIâ€™ is used as a qualification. E.g. â€˜ANSI Windows codepagesâ€™, meaning â€˜ANSI (Windows codepages)â€™, the codepages that can be used as Windows ANSI, i.e., that can be returned by GetACP.</li>

<li><a id="FN09"></a>...to compilers that support UTF-8 source code, which all the relevant compilers do. More in general <em>portable</em> for C++ means portable within the limits of the language implementation that one ports to. E.g., putting this to the point, the C++ standard does not specify the size of <code>bool</code> so that frivolous use of <code>bool</code> type local variables conceivably could exceed the available memory, yet such code is portable. One reviewer has however argued that C++ only supports source code with the characters formally guaranteed to be supported, i.e. only pure ASCII source code with no â€œ$â€ signs, portably.</li>

<li><a id="FN10"></a>Judging by the N3693 draft Technical Specification at http://isocpp.org/files/papers/N3693.html</li>

<li><a id="FN11"></a>Wikipedia lists the TR2 proposals at http://en.wikipedia.org/wiki/C++_Technical_Report_1#Technical_Report_2</li>

<li><a id="FN12"></a>Internally the <code>boost::filesystem::path</code> class uses a representation of international text where the public definition <code>value_type</code> corresponds to the â€˜rawâ€™ encoding value type discussed in this article, with UTF-8 for *nix and UTF-16 for Windows. Presumably with C++14 (if that should be the next C++ standard), this articleâ€™s <code>Raw_syschar </code>could be defined as <code>std::filesystem::path::value_type</code>.</li>

<li><a id="FN13"></a>I filed a ticket about its disappearance in 2011, #6065 available at https://svn.boost.org/trac/boost/ticket/6065</li>

<li><a id="FN14"></a>The N3693 draft Technical Specification contains this wording in its Â§8.4.6: â€œImplementations of the standard library for systems where <code>string_type</code> is <code>wstring</code>, such as Windows, are encouraged to provide an extension to existing standard library file stream constructors and open functions that adds overloads that accept <code>wstring</code>s for file names. Microsoft and Dinkumware already provide such an extension.â€</li>

<li><a id="FN15"></a>The <code>wchar_t</code> type can be argued to be such a type, but itâ€™s impractical for the purpose of portability.</li>

<li><a id="FN16"></a>While itâ€™s not guaranteed by the C++ standard, as far as I know thereâ€™s no compiler that by default will yield <code>sizeof(T)</code> &gt; 1 when <code>T</code> is a POD class with just a single <code>char</code> data member.</li>

<li><a id="FN17"></a>The last time I checked, two or three years ago, it did happen with Visual C++â€™s <code>std::string</code>.</li>

<li><a id="FN18"></a>If names of e.g. control characters are desired then one can use an <code>enum class</code> in order to support easy name qualification, but for this articleâ€™s exposition <code>enum class</code> would not have a purpose.</li>

<li><a id="FN19"></a>Here <code>CPP_NOEXCEPT</code> is a macro that depending on the compiler is defined as C++11 <code>noexcept</code> (e.g. for g++ and clang) or C++03 <code>throw()</code> (for Visual C++ 11.0 and earlier).</li>

<li><a id="FN20"></a>As a practical matter, for UTF-8 encoded source code the Visual C++ compiler <strong>requires</strong> a Byte Order Mark (BOM) in order to correctly deduce the encoding. Some earlier versions of the g++ compiler didnâ€™t support a BOM for UTF-8, but now it does so that itâ€™s not even necessary to do that minimal source code encoding conversion. The same source can be used exactly as-is for both systems.</li>

</ol>
</p>
<p><strong>Notes:</strong>&nbsp;</p>
<p><em>More fields may be available via dynamicdata ..</em></p>
</div>
</channel>
</rss>
