    <rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:sy="http://purl.org/rss/1.0/modules/syndication/" xmlns:admin="http://webns.net/mvcb/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:content="http://purl.org/rss/1.0/modules/content/">
     <channel>
        <title>ACCU  :: Portable Console I/O via iostreams</title>
        <link>https://members.accu.org/index.php/articles/2404</link>
        <description>Professionalism in Programming</description>
        <dc:language>en-us</dc:language> 
        <dc:creator>Administrator</dc:creator> 
        <admin:generatorAgent rdf:resource="http://www.xaraya.org" /> 
        <admin:errorReportsTo rdf:resource="mailto:webeditor@accu.org" />
       <sy:updatePeriod>hourly</sy:updatePeriod>
       <sy:updateFrequency>1</sy:updateFrequency>
       <docs>http://backend.userland.com/rss</docs>




<div class="xar-mod-head"><span class="xar-mod-title">Programming Topics + Overload Journal #140 - August 2017</span></div>

<table border="0" cellpadding="1" cellspacing="0">
    <tbody>
    <tr>
        <td valign="top">
            Browse in :
       </td>
       <td valign="top">

                                            <a href="https://members.accu.org/index.php/articles/">All</a>

                     &gt;                         <a href="https://members.accu.org/index.php/articles/c13/">Topics</a>

                     &gt;                         <a href="https://members.accu.org/index.php/articles/c65/">Programming</a>
<br />

                                            <a href="https://members.accu.org/index.php/articles/">All</a>

                     &gt;                         <a href="https://members.accu.org/index.php/articles/c76/">Journals</a>

                     &gt;                         <a href="https://members.accu.org/index.php/articles/c78/">Overload</a>

                     &gt;                         <a href="https://members.accu.org/index.php/articles/c376/">o140</a>
<br />

                                            <a href="https://members.accu.org/index.php/articles/c65-376/">Any of these categories</a>

                    -                        <a href="https://members.accu.org/index.php/articles/c65+376/">All of these categories</a>
<br />
</td>
   </tr>
   </tbody>
</table>




<div class="xar-error">
   <p>
 <strong>Note:</strong> when you create a new publication type,
the articles module will automatically use the templates
<em>user-display-[publicationtype].xt</em>
and <em>user-summary-[publicationtype].xt</em>.
If those templates do not exist when you try to preview or display a new article,
you'll get this warning :-)  Please place your own templates in themes/<em>yourtheme</em>/modules/articles . The templates will get the extension .xt there. </p>
</div>
<div class="xar-norm xar-standard-box-padding">
   <h1><strong>Title:</strong>&nbsp;Portable Console I/O via iostreams</h1>
<p><strong>Author:</strong>&nbsp;Bob Schmidt</p>
<p>
<strong>Date:</strong> 04 August 2017 00:45:16 +01:00 or Fri, 04 August 2017 00:45:16 +01:00</p>
<p><strong>Summary:</strong>&nbsp;Portable streaming is challenging. Alf Steinbach describes how his library fixes problems with non-ASCII characters.</p>
<p><strong>Body:</strong>&nbsp;<p>My Boost licensed <strong>stdlib</strong> header library [<a href="#[stdlib]">stdlib</a>] applies some crucial fixes to the C++ implementationâ€™s standard library, and provides a (hopefully) complete set of wrapper headers that apply these fixes; some functionality used internally in the <strong>stdlib</strong> implementation; and a number of convenience headers for the standard library.</p>

<p>The most important fix, because it enables portability and reasonable functionality for beginnersâ€™ programs, is of <code>char</code>-based text iostreams (e.g. <code>cout</code>) console i/o in Windows. <strong>stdlib</strong> installs special buffers in the standard iostreams that are connected to the console, and these buffers provide an UTF-8 view of the console. That means that portable ordinary <code>char</code> and <code>std::string</code> based code can present e.g. Norwegian and Russian text in the console, via <code>cout</code>, and can input international text from the user, via <code>cin</code>.</p>

<p><strong>stdlib</strong> also provides an UTF-16 view of the console for <code>wchar_t</code> based  i/o via the wide iostreams, such as <code>wcout</code>.</p>

<p>The UTF-16 view was functionality that essentially came for free, because it was base functionality needed for the UTF-8 view, and it means that in addition to supporting portable <code>char</code> based code <strong>stdlib</strong> also supports <code>wchar_t</code>-based pure Windows programs.</p>

<p>Here I discuss only this <strong>portable console i/o</strong> aspect of <strong>stdlib</strong> â€“ the other <strong>stdlib</strong> stuff is also nice, but is not as significant.</p>

<h2>Goal: portable console i/o</h2>

<p>The main goal with <strong>stdlib</strong> was to enable simple textbook style console based exploratory C++ programs, like the example in Listing 1.</p>

<table class="sidebartable">
	<tr>
		<td><img src="/content/images/journals/ol140/Steinbach/Steinbach_01.jpg" /></td>
	</tr>
	<tr>
		<td class="title">Listing 1</td>
	</tr>
</table>

<p>A student should be able to type in his or her own non-English name into this program, and see it accurately presented back by the program, <em>also in Windows</em>. This goal is accomplished, modulo the Windows console windowsâ€™ restriction to the BMP<a href="#FN01"><sup>1</sup></a> part of Unicode.</p>

<p>Without a console i/o fix applied, Visual C++â€™s runtime library forwards the nullbytes that a Windows console window in UTF-8 mode (codepage 65001) produces for non-ASCII characters, i.e. yielding a <code>name</code> string with embedded nullbytes, which in the console windowâ€™s presentation leaves blank areas (see Figure 1).</p>

<table class="sidebartable">
	<tr>
		<td><img src="/content/images/journals/ol140/Steinbach/Steinbach_02.jpg" /></td>
	</tr>
	<tr>
		<td class="title">Figure 1</td>
	</tr>
</table>

<p>Using the Visual C++ 2017 compiler <code>cl</code> in Windows 10 and applying the <strong>stdlib</strong> i/o fix via the <code>/FI</code> option for a forced include gives the output in Figure 2.</p>

<table class="sidebartable">
	<tr>
		<td><img src="/content/images/journals/ol140/Steinbach/Steinbach_03.jpg" /></td>
	</tr>
	<tr>
		<td class="title">Figure 2</td>
	</tr>
</table>

<p>This correct result is independent of the console windowâ€™s active codepage, and is the same in the *nix world.</p>

<p>The <strong>stdlib</strong> i/o fix includes a convenience <code>#pragma</code> for Visual C++, setting the execution character set to UTF-8, for otherwise the execution character set would have had to be specified explicitly as UTF-8 in every compilation, like the <code>/utf-8</code> option in the first compiler invocation above. Visual C++ defaults to Windows ANSI encoding, which depends on the locale Windows is installed for. With g++ the execution character set default is already UTF-8.</p>

<h2>The technical problem(s)</h2>

<p class="quote">I hate to hear â€˜Less is more.â€™ Itâ€™s a crock of crap.<br />
~ R. Lee Ermey, American soldier and movie star of <em>Full Metal Jacket</em> [<a href="#[Ermey]">Ermey</a>]</p>

<p>The C and C++ standard librariesâ€™ unified view of console, pipe and file i/o as minimalist streams of bytes, works fine in the *nix world where C and C++ originated. But Windows is based on different ideas, ideas of more rich standard functionality â€“ much richer standard functionality. And so, in Windows the limited byte streams are second or third class citizens, not the primary way to interact with consoles: the streams are evidently there as backward compatibility support for archaic pre-Unicode programs, because UTF-8 console input Just Doesnâ€™t Workâ„¢ for non-ASCII characters.</p>

<p>So, what happens if you tell a Windows console window to use UTF-8 encoding, by setting its active codepage to 65001?</p>

<p>As of Windows 10 byte stream output appears to work, but, down at the Windows API level, byte stream input of non-ASCII characters produces just nullbytes, as illustrated by a program that directly uses Windowsâ€™ <code>ReadFile</code> and <code>WriteFile</code> functions (see Figure 3).</p>

<table class="sidebartable">
	<tr>
		<td><img src="/content/images/journals/ol140/Steinbach/Steinbach_04.jpg" /></td>
	</tr>
	<tr>
		<td class="title">Figure 3</td>
	</tr>
</table>

<p>Additionally, Visual C++â€™s  <code>setlocale</code> in Windows [<a href="#[Microsoft-a]">Microsoft-a</a>] explicitly does not support UTF-8. A possible reason is the C standardâ€™s requirement that a <code>wchar_t</code> â€œ<em>can represent distinct codes for all members of the largest extended character set specified among the supported locales</em>â€ [<a href="#[C99]">C99</a>]. For Windowsâ€™ <code>wchar_t</code> type, from the early Unicode adoption, is just 16 bits, which with modern 21-bit Unicode is not enough for all members of an UTF-8 locale.</p>

<p>And in addition to the limited Windows support for UTF-8 in consoles, the C and C++ standard libraries fail to support UTF-8 text handling. There is no functionality for iterating over code points (which can be of a variable number of bytes); the functionality for <code>char</code> classification, such as the C libraryâ€™s <code>isupper</code>, only works for single bytes, i.e. when the UTF-8 character is in the ASCII subset; the C++ libraryâ€™s <code>std::ctype::widen</code>, which can deal with a string of encoding units, is rendered impotent for portable code by the fact that thereâ€™s no UTF-8 locale in Windows, so thereâ€™s no way to tell it that those bytes are UTF-8 encoded text; and so on, and on. AFAIK thereâ€™s no solution that addresses all the issues.</p>

<p>However, the lack of C++ standard library support was not a showstopper for the *nix worldâ€™s transition to UTF-8. In the late 1990s and early 2000s one simply let existing tools treat UTF-8 as extended ASCII text with occasional pass-them-right-through-please hey just ignore them high value bytes. Today, as of 2017, the *nix world appears to be all UTF-8 for text files, so that approach worked, and hence it can presumably also work for Windows.</p>

<h2>Possible solutions</h2>

<p>The missing functionality for text handling is offered by various 3rd party libraries, including IBMâ€™s open source ICU library [<a href="#[ICU]">ICU</a>], and Boost Locale, which is a <code>char</code>-based wrapper over ICU. The Boost Locale documentation notes that â€œ<em>The default character encoding is assumed to be UTF-8 on Windows</em>â€ [<a href="#[Boost-a]">Boost-a</a>]. So evidently, an assumption of UTF-8 as the main text encoding on every platform, including in Windows, is not unheard of.</p>

<p>A mainly all UTF-8 approach for external text and for simple processing, with conversion to and from UTF-16 for e.g. use of ICU, seems to be where weâ€™re heading, also for Windows programs.</p>

<p>Anyway, to work with international text in Windows consoles, especially for beginners, itâ€™s practically necessary to</p>

<ul>
	<li>change the default font for Windows console windows<a href="#FN02"><sup>2</sup></a> to one that can display international characters, such as Lucida Console, or else use 3rd party console windows.</li>
</ul>

<p>With that display fix in place one basically has three options for portable C++ code:</p>

<ul>
	<li>use byte stream i/o with some fix applied in Windows, e.g. the standard libraryâ€™s byte streams with a restricted character set from a national codepage, or with conversion to/from internal UTF-8 such as provided by <strong>stdlib</strong>;</li>

	<li>use wide stream i/o (note: the standard libraryâ€™s wide stream i/o converts to and from external byte streams) with some platform-dependent fix applied, e.g. in Windows, using the standard libraryâ€™s wide streams with Microsoftâ€™s <code>_setmode</code> extension [<a href="#[Microsoft-b]">Microsoft-b</a>], or again using <strong>stdlib</strong>, and in the *nix world, with a suitable UTF-8 locale; or</li>

	<li>use an abstraction that transparently adapts the encoding to the system, selecting between byte and wide stream i/o within the implementation of that abstraction, with an encoding unit type suitably defined for each system.</li>
</ul>

<p>Some years ago, I saw adaptive encoding and i/o as a viable compromise between conflicting goals [<a href="#[Steinbach13]">Steinbach13</a>].</p>

<p>One main problem with that approach, however, is that itâ€™s necessarily intrusive, e.g. requiring string literals wrapped in adaptive macro calls like <code>S(&quot;Hi&quot;)</code> and use of standard streams via adaptive references like <code>sys::out</code> for <code>std::cout</code>, so that</p>

<ul>
	<li>the approach canâ€™t handle simple textbook example program code as-is, and hence</li>

	<li>existing code doesnâ€™t automatically benefit.</li>
</ul>

<p>This is what <strong>stdlib</strong> addresses with its UTF-8 console i/o: it can handle textbook example program code as-is, and if existing code uses the C++ iostreams, then that code benefits automatically.</p>

<p>In contrast the <strong>nowide</strong> library [<a href="#[nowide]">nowide</a>], adopted in Boost [<a href="#[Boost-b]">Boost-b</a>] in June 2017, is an intrusive UTF-8 i/o approach, and thus, except that it handles ordinary nartr literals, it suffers from the drawbacks above.</p>

<p>The <strong>nowide</strong> web page refers to a 2011 blog posting of mine [<a href="#[Steinbach11]">Steinbach11</a>] about Unicode in Windows console windows, which, incidentally, is how I became aware of <strong>nowide</strong>, some time after I started work on <strong>stdlib</strong>. In that article, I argued for leveraging Microsoftâ€™s <code>_setmode</code> extension, using wide text internally in the C++ program, and I referred to a 2008 blog posting by Microsoftâ€™s Unicode guru Michael Kaplan, titled â€˜Conventional wisdom is retarded, aka What the @#%&amp;* is <code>_O_U16TEXT</code>?â€™ [<a href="#[Kaplan08]">Kaplan08</a>]. Both <strong>stdlib</strong> and <strong>nowide</strong> now go in the opposite direction, using nartr text internally in C++.</p>

<h2>General comparison: adaptive versus stdlib versus nowide</h2>

<p>The C++ core language is involved in two areas: string literals and process command line arguments, namely the arguments of <code>main</code>. Happily, with the all UTF-8 approach of <strong>stdlib</strong> and <strong>nowide</strong>, and with modern compilersâ€™ (especially now Visual C++â€™s) support for UTF-8 as the execution character set, one can just use ordinary nartr literals. Unfortunately, there seems to be no portable non-intrusive way to fix the encoding of the arguments of <code>main</code> in Windows, and so both libraries provide intrusive, portable means of obtaining UTF-8 encoded command line arguments.</p>

<p>Apart from that the <strong>stdlib</strong> library is based on only providing transparent <em>fixes</em> to the standard library implementation, and a minimum of new functionality, while the adaptive approach and the <strong>nowide</strong> library are based on providing<em> alternatives</em> to the core language and standard library in certain areas.</p>

<p>With <strong>stdlib</strong>â€™s goal of providing as little new functionality as possible, checking which of <strong>stdlib</strong> and other libraries provide the most features, would be mostly meaningless. But one can still compare general goals or ideals achievement for the libraries. For the adaptive approach, the table below just lists what will be generally true of any reasonable implementation of that approach.</p>

<table>
	<tr>
		<th>Goal/ideal</th>
		<th>Adaptive</th>
		<th>stdlib</th>
		<th>nowide</th>
	</tr>

	<tr>
		<td colspan="4">
			<strong>General</strong>
		</td>
	</tr>

	<tr>
		<td>Working nartr Unicode console i/o</td>
		<td>n/a</td>
		<td>Success</td>
		<td>Partial</td>
	</tr>

	<tr>
		<td>Working wide Unicode console i/o</td>
		<td>n/a</td>
		<td>Success</td>
		<td>Failure</td>
	</tr>

	<tr>
		<td>That it fails gracefully for bad data</td>
		<td>-</td>
		<td>Success</td>
		<td>Failure</td>
	</tr>

	<tr>
		<td><strong>Support of coding</strong></td>
	</tr>

	<tr>
		<td>Idiomatic <code>char</code> based learnerâ€™s C++</td>
		<td>Failure</td>
		<td>Success</td>
		<td>Success</td>
	</tr>

	<tr>
		<td>No <span class="filename">&lt;windows.h&gt;</span> namespace pollution</td>
		<td>-</td>
		<td>Success</td>
		<td>Success</td>
	</tr>

	<tr>
		<td>Few or no explicit encoding conversions</td>
		<td>Partial</td>
		<td>Failure</td>
		<td>Failure</td>
	</tr>

	<tr>
		<td>Using textbook example code as-is</td>
		<td>Failure</td>
		<td>Mostly</td>
		<td>Failure</td>
	</tr>

	<tr>
		<td>Automatic benefit for existing code</td>
		<td>Failure</td>
		<td>Mostly</td>
		<td>Failure</td>
	</tr>

	<tr>
		<td><strong>Support of building &amp; other tool usage</strong></td>
	</tr>

	<tr>
		<td>No large 3rd party library dependency</td>
		<td>-</td>
		<td>Success</td>
		<td>Success</td>
	</tr>

	<tr>
		<td>Header only library</td>
		<td>-</td>
		<td>Success</td>
		<td>Failure</td>
	</tr>

	<tr>
		<td>Tools, e.g. string display in debuggers</td>
		<td>Success</td>
		<td>Failure</td>
		<td>Failure</td>
	</tr>

	<tr>
		<td>Clean build with common compilers</td>
		<td>-</td>
		<td>Success</td>
		<td>Failure</td>
	</tr>
</table>

<p>My â€˜partialâ€™ mark on <strong>nowide</strong>â€™s working is mainly due to its failure to remove carriage return characters from input in Windows (Listing 2). The result is in Figure 4.</p>

<table class="sidebartable">
	<tr>
		<td><img src="/content/images/journals/ol140/Steinbach/Steinbach_05.jpg" /></td>
	</tr>
	<tr>
		<td class="title">Listing 2</td>
	</tr>
</table>

<table class="sidebartable">
	<tr>
		<td><img src="/content/images/journals/ol140/Steinbach/Steinbach_06.jpg" /></td>
	</tr>
	<tr>
		<td class="title">Figure 4</td>
	</tr>
</table>

<p>This problem, plus a ditto problem with Windowsâ€™ convention of using Ctrl Z as EOF marker, has probably already been fixed by the time youâ€™re reading this. But I was perplexed to discover that the library bungled input, which is so fundamental to what itâ€™s all about, after it had been approved for Boost. Itâ€™s really strange.</p>

<p>With Visual Studioâ€™s debugger in Windows one can use the format specifier <code>,s8</code> on a watch of a raw C string to force UTF-8 interpretation of the bytes. However, with other presentations of nartr strings the VS debugger uses Windows ANSI, even when the programâ€™s execution character set is UTF-8, with gobbledygook as the result. This is the main tool support failure of <strong>stdlib</strong> and <strong>nowide</strong>, and itâ€™s one area where the adaptive approach would shine.</p>

<p>Hopefully, in the not distant future the Visual Studio debugger will gain some option to assume UTF-8, or maybe it will just pick up what the programâ€™s execution character set is, not to mention encoding information for each literal, and use that.</p>

<p><strong>stdlib</strong>â€™s not quite 100% success in supporting textbook example code is due to the following constraints:</p>

<ul>
	<li>automatic conversion to/from internal UTF-8 for console i/o seems to not be portably possible for C <code>FILE*</code> i/o, and</li>

	<li>with both Visual C++ and MinGW g++ the arguments of <code>main</code> are (incorrectly) Windows ANSI-encoded even when the execution character set is UTF-8, and a transparent automatic fix appears to not be practically possible.</li>
</ul>

<h2>Command line arguments in stdlib versus nowide</h2>

<p>Both <strong>stdlib</strong> and <strong>nowide</strong> assume that <code>main</code> arguments on other platforms than Windows are UTF-8 encoded. In Windows, they both use the <code>GetCommandLineW</code> API function to obtain the original UTF-16 encoded command line passed to the process, and <code>CommandLineToArgvW</code> to parse it into individual arguments. <strong>stdlib</strong> uses this info to provide a separate set of UTF-8 encoded original command line arguments, while <strong>nowide</strong> uses the info to replace the <code>main</code> arguments with UTF-8 encoded originals.</p>

<p>The intended default usage in <strong>stdlib</strong> (and what I hope for in some future C++ standard library support for this) is that a <code>Command_line_args</code> object should be default-constructed wherever command line arguments are needed, which supports use in e.g. the constructor of a namespace scope variable, or in some other function without access to the actual <code>main</code> arguments.</p>

<p>As of July 2017, default construction of <code>Command_line_args</code> is implemented only for Windows and Linux, but code that only needs to be portable to these two systems can look like Listing 3.</p>

<table class="sidebartable">
	<tr>
		<td><img src="/content/images/journals/ol140/Steinbach/Steinbach_07.jpg" /></td>
	</tr>
	<tr>
		<td class="title">Listing 3</td>
	</tr>
</table>

<p>This can be made fully portable by replacing the <code>main</code> code with Listing 4... which, however, is not possible for the mentioned case of constructor for a namespace scope variable (without employing a time machine to check what the future call of <code>main</code> will have).</p>

<table class="sidebartable">
	<tr>
		<td><img src="/content/images/journals/ol140/Steinbach/Steinbach_08.jpg" /></td>
	</tr>
	<tr>
		<td class="title">Listing 4</td>
	</tr>
</table>

<p>The <strong>nowide</strong> library offers only this latter restricted approach of passing the actual <code>main</code> arguments to a fixer object (see Listing 5).</p>

<table class="sidebartable">
	<tr>
		<td><img src="/content/images/journals/ol140/Steinbach/Steinbach_09.jpg" /></td>
	</tr>
	<tr>
		<td class="title">Listing 5</td>
	</tr>
</table>

<p>Using the *nix world convention of representing the command line arguments as an <code>int</code> + <code>char**</code> pair makes it easy to use library functions based on that convention, such as <code>getopt</code>. With <strong>stdlib</strong> the <code>Command_argv_array</code> class offers this value pair. A key difference is that an instance of <strong>stdlib</strong>â€™s <code>Command_argv_array</code> is a copy of the argument string data, so that the data can be freely modified.</p>

<p>Note: with MinGW g++ and <strong>nowide</strong> the value of <code>n</code> above <em>can be reduced</em> by the declaration of the <code>nowide::args</code> variable, because MinGW g++ provides wildcard expansion of arguments, and the synthesized UTF-8 encoded arguments are not expanded.</p>

<p> Neither <strong>stdlib</strong> nor <strong>nowide</strong> provide dedicated wildcard expansion functionality, but <strong>stdlib</strong> offers portable access to the C++17 filesystem library, which combined with some regular expression matching can do the chore. However, thatâ€™s quite complex machinery. E.g. with normal Windows filename wildcards a <code>*</code> doesnâ€™t match backward slashes (which a regular expression simple <code>.*</code> pattern does), and one has to deal with absolute and relative paths. I think wildcard expansion functionality properly belongs with the iteration ability of the filesystem library, and not with mainly a console i/o fix library. Alas, the filesystem library does not yet offer this functionality.</p>

<h2>Using the C++17 filesystem library</h2>

<p>Sometimes an executable has associated files such as configuration files and resource files, placed in the directory that itself resides in, or in some sub-directory there. Thus, sometimes one needs a path to the executableâ€™s directory. The â€˜current directoryâ€™, the default origin for relative paths, can be and often is some other directory. Usually the current directory is initially the directory from which the program was launched this time, i.e. some arbitrary directory, anywhere. Since the current directory is used automatically, client code does not usually need its path for e.g. resolving command line filename arguments. But client code does, in general, need the path to the executableâ€™s directory.</p>

<p>However, the C++17 filesystem library</p>

<ul>
	<li>provides the generally not needed current directory path, <em>fs</em><code>::current_path()</code> â€“ where <em>fs</em> denotes <code>std::filesystem</code> â€“ and</li>

	<li>does not provide the often crucial executableâ€™s directory path.</li>
</ul>

<p>Happily, the first process command line argument, the first argument of <code>main</code>, is in practice a relative or absolute path to the executable. This is not formally guaranteed, but in practice itâ€™s nearly always so. Ideally then, to determine a path to the executableâ€™s directory, code like this should be sufficient (see Listing 6).</p>

<table class="sidebartable">
	<tr>
		<td><img src="/content/images/journals/ol140/Steinbach/Steinbach_10.jpg" /></td>
	</tr>
	<tr>
		<td class="title">Listing 6</td>
	</tr>
</table>

<p>But run the program from a directory where the relative path to the executableâ€™s directory contains non-ASCII characters<a href="#FN03"><sup>3</sup></a>, and then this simple, natural and (assuming the first argument of <code>main</code> actually refers to the executable) formally correct code, fails (Figure 5).</p>

<table class="sidebartable">
	<tr>
		<td><img src="/content/images/journals/ol140/Steinbach/Steinbach_11.jpg" /></td>
	</tr>
	<tr>
		<td class="title">Figure 5</td>
	</tr>
</table>

<p>Whatâ€™s going on here?</p>

<p>Running from the executableâ€™s directory would work because with this code the name of the executable, passed to <em>fs</em><code>::absolute()</code>, is then effectively a dummy â€“ any filename-like string would do.</p>

<p>But running it from the parent directory involves a non-ASCII character, Ï€, in the path, which is served correctly, as UTF-8, to <em>fs</em><code>::absolute()</code>. Here things go haywire because, as of July 2017, the Visual C++ and MinGW g++ implementations of the C++17 filesystem library <em>ignore</em> the execution character set and instead assume that nartr strings are and should be Windows ANSI encodedâ€¦ Since Windows ANSI is a country-specific encoding choice the result <code>Ãâ‚¬</code> can even be different on other machines.</p>

<p>Itâ€™s trivially easy to check if the execution character set is UTF-8, and these implementations lay down the rules from scratch, with no frozen history constraining them. So, as I see it, the behaviour is really not excusable. Unfortunately, as far as I know thereâ€™s no way that <strong>stdlib</strong> can fix this functionality transparently.</p>

<p>Until all common implementations of the C++17 filesystem library conform to the standard one therefore has to be very careful about always explicitly specifying UTF-8 in code using the filesystem library, by e.g. using the <em>fs</em><code>::u8path</code> factory function (see Listing 7).</p>

<table class="sidebartable">
	<tr>
		<td><img src="/content/images/journals/ol140/Steinbach/Steinbach_12.jpg" /></td>
	</tr>
	<tr>
		<td class="title">Listing 7</td>
	</tr>
</table>

<p>â€¦ and the other way by using e.g. the <em>fs</em><code>::path::u8string</code> conversion function:</p>

<pre class="programlisting">
  string const dfp_utf8 = df_path.<strong>u8</strong>string();</pre>

<p>In the first example <code>&quot;data&quot;</code> contains only ASCII characters and can therefore be served raw to the filesystem machinery, but <code>&quot;blueberry-</code>Ï€<code>.txt&quot;</code> is decidedly non-ASCII so that it must be manually tagged as Unicode via a call to <em>fs</em><code>::u8path</code>.</p>

<p>As with the <strong>nowide</strong> libraryâ€™s incorrect console input operation in Windows, the continued existence of this fundamental level failure of the filesystem library implementations, so very far into the game, appears perplexing, bewildering, inexplicable. But hopefully both the Visual C++ and the MinGW g++ implementations will be fixed. And, as Jerry Pournelle used to put it, Real Soon Nowâ„¢.</p>

<p>The workarounds, the extra care and explicitness, is all thatâ€™s needed with Visual C++. However, with MinGW g++ 7.1 and earlier the workarounds run into another filesystem implementation bug. For the MinGW g++ 7.1 implementation of <em>fs</em><code>::u8path</code> can only handle UTF-16 encoded wide stringsâ€¦</p>

<p>Happily, <strong>stdlib</strong> provides a transparent fix for that.</p>

<p>But, that fix must be explicitly requested, by defining <code>STDLIB_FIX_GCC_U8PATH</code>, because itâ€™s function template specializations that at least in theory wonâ€™t necessarily build for a later or earlier version of the compiler, though this code may still work and may be necessary also for such versions. (See Figure 6.)</p>

<table class="sidebartable">
	<tr>
		<td><img src="/content/images/journals/ol140/Steinbach/Steinbach_13.jpg" /></td>
	</tr>
	<tr>
		<td class="title">Figure 6</td>
	</tr>
</table>

<p>In passing: internally this fix uses <code>stdlib::wide_from_utf8</code> and <code>stdlib::utf8_from</code>, which are among the library implementation features that are made available via <strong>stdlib</strong>â€™s public interface.<a href="#FN04"><sup>4</sup></a></p>

<p>The fix is not needed in the *nix world. In the *nix world <em>fs</em><code>::u8path</code> converts the argument to <code>std::string</code> with no encoding change. And so, for example, in Ubuntu, using g++ 6.3.0, the code compiles and works fine without the fix.</p>

<p>Just as MinGW g++ 7.1â€™s <em>fs</em><code>::u8path</code> punts on implementing an UTF-8 â†’ UTF-16 conversion in Windows, with MinGW g++ 7.1 an <em>fs</em><code>::path</code> argument to a file iostream constructor is not supported, though itâ€™s required by C++17. The lack of <em>fs</em><code>::path</code> argument is problematic because g++â€™s default standard library implementation doesnâ€™t support wide string argument<a href="#FN05"><sup>5</sup></a>, either, and a nartr string path argument is assumed to be Windows ANSI encoded. And yes, thatâ€™s even with UTF-8 execution character set.</p>

<p>There are three main solutions where portable Unicode paths are required:</p>

<ul>
	<li>Only C++17-compatible compilers.
		<p>This means not using MinGW g++, or not testing parts of the code with MinGW g++, or waiting until MinGW g++â€™s filesystem and iostreams library implementations are fixed.</p>
	</li>

	<li>Pure ASCII alternative paths.
		<p>Windows supports, although not completely and not for all Windows â€˜technologiesâ€™, alternative pure ASCII paths. These are called short paths. The <strong>stdlib</strong> library provides a more robust abstraction, a best effort mostly readable native encoding nartr path, as <code>stdlib::char_path()</code> &amp; friends.</p>
	</li>

	<li>Custom iostream class.
		<p>If one controls the file opening code, then better replace e.g. <code>std::ifstream</code> with a custom iostream class that supports <em>fs</em><code>::path</code> or wide string argument, or best, that directly and portably supports UTF-8 encoded nartr string argument. The <strong>nowide</strong> library provides that as <code>nowide::ifstream</code> &amp; friends. Such a class can also relatively easily be implemented in terms of <code>__gnu_cxx::stdio_filebuf&lt;char&gt;</code>.</p>
	</li>
</ul>

<p>Alternative ASCII paths were the basis of the MinGW g++ fix employed in the early Boost Filesystem, version 2 [<a href="#[Boost-c]">Boost-c</a>], but it was discontinued with no alternative fix in version 3, apparently deferring that fix to standardization. The original filesystem TS suggested that iostream constructors in Windows implementations should support the Visual C++ extension of wide character path argument. With C++17 we additionally have iostream constructors accepting <em>fs</em><code>::path</code> directly, except that â€“ the problem â€“ as of this writing, MinGW g++â€™s default standard library implements neither.</p>

<p>Figure 7 is an example of a pure ASCII alternative path in Windows.</p>

<table class="sidebartable">
	<tr>
		<td><img src="/content/images/journals/ol140/Steinbach/Steinbach_14.jpg" /></td>
	</tr>
	<tr>
		<td class="title">Figure 7</td>
	</tr>
</table>

<p>For readability and to preserve as much information as possible, especially for a name of a file to be created, <code>stdlib::char_path()</code> provides a Windows ANSI path, not a pure ASCII path, where it retains (transcoded) those items of the original Unicode path specification that can be encoded exactly as Windows ANSI (Figure 8).</p>

<table class="sidebartable">
	<tr>
		<td><img src="/content/images/journals/ol140/Steinbach/Steinbach_15.jpg" /></td>
	</tr>
	<tr>
		<td class="title">Figure 8</td>
	</tr>
</table>

<p>Where an item canâ€™t be represented exactly as Windows ANSI and doesnâ€™t have an alternative ASCII name, <code>char_path</code> replaces any non-ANSI character with <code>stdlib::ascii::bad_char</code>, ASCII 127. I assume that this is often the desired behaviour: deferring path validity checking to the file opening code, and just using the path with replacements if it works, e.g. for display, or for creating a file. In contrast, <code>stdlib::char_path_or_x</code> thtrs a <code>std::runtime_error</code> exception if the Unicode path canâ€™t be represented exactly.</p>

<p>The design intention is to use <code>char_path</code> by default, e.g. for portably passing nartr paths to 3rd party library code, and as a not quite 100% but mostly Just Good Enoughâ„¢ workaround/fix for filesystem-challenged implementations, like Listing 8.</p>

<table class="sidebartable">
	<tr>
		<td><img src="/content/images/journals/ol140/Steinbach/Steinbach_16.jpg" /></td>
	</tr>
	<tr>
		<td class="title">Listing 8</td>
	</tr>
</table>

<p>Here, the UTF-8 path is used in the failure reporting instead of just outputting the <em>fs</em><code>::path</code> directly, because while MinGW g++ 7.1 curiously does support that it adds simple ASCII quotes and duplicates every backslash, sort of happily sabotaging things.</p>

<p>As mentioned, the newly adopted-in-Boost <strong>nowide</strong> library provides streams that can be opened with UTF-8 encoded paths. And for file opening code that one controls, using an alternative file iostream implementation solves the availability problems of Windows ASCII alternative paths. For the code above, with the standalone variant of <strong>nowide</strong>, this solution entails just adding a</p>

<pre class="programlisting">
  #include &lt;nowide/fstream.hpp&gt;</pre>

<p>replacing <code>ifstream f{ dfp_native };</code> with</p>

<pre class="programlisting">
  nowide::ifstream f{ dfp_utf8 };</pre>

<p>and removing the <code>dfp_native</code> lines, and thatâ€™s all.</p>

<p>With this approach, one uses each library for what itâ€™s good at.</p>

<table class="sidebartable">
	<tr>
		<td class="title">ASCII Alternative Paths</td>
	</tr>

	<tr>
		<td>
			<table class="journaltable">
				<tr>
					<td><p>In the *nix world, <code>stdlib::char_path()</code> just returns the argument converted to UTF-8 if necessary, and in Windows it uses the following algorithm to return a best effort readable ANSI path:</p>
					<p style="margin-left:1em"><em>let</em> R (the result) be an empty string.</p>
					<p style="margin-left:1em"><em>for</em> each item in the Unicode path:</p>
					<p style="margin-left:2em"><em>if</em> the item is ASCII <em>then</em></p>
					<p style="margin-left:3em">append it to R.</p>
					<p style="margin-left:2em"><em>else if</em> it converts exactly to Windows ANSI <em>then</em></p>
					<p style="margin-left:3em">append the converted item to R.</p>
					<p style="margin-left:2em">else if it has an alternative ASCII name <em>then</em></p>
					<p style="margin-left:3em">append the alternative ASCII name to R.</p>
					<p style="margin-left:2em"><em>else if</em> character substitution is permitted <em>then</em></p>
					<p style="margin-left:3em">convert the item to ANSI, possibly with substitutions.</p>
					<p style="margin-left:3em">append this possibly inexact ANSI text to R.</p>
					<p style="margin-left:2em"><em>else</em></p>
					<p style="margin-left:3em">fail by throwing a <code>std::runtime_error</code>.</p>
					
					<p>The order of checking is crucial to not needlessly discard information.</p>

					<p>If you want to implement this yourself, then do note that the short very Unicody <code>&#960;</code> as a path item is left as is by Windowâ€™s main API function for this, <code>GetShortPathName</code>, presumably because <code>&#960;</code> is so short. Itâ€™s quite perplexing. For, while ASCII alternative paths are a very nice feature indeed, who needs a transformation of Unicode paths to still Unicode unreadable ultimate shortness with cryptic digit sequences, tildes and uppercasing thrown in here and there? I canâ€™t think of any need for that. It appears to be just silly.</p>
					
					<p>Happily the <code>FindFirstFile</code> API function does give a pure ASCII alternative for that <code>&#960;</code>, on a Windows installation and filesystem that supports short paths. And it apparently works fine in general, but only on one single path item, namely the last.</p>
					
					<p>Problems include that short filenames in principle can be turned off via a registry setting (though itâ€™s unlikely, considering that they e.g. appear in registry values), that short filenames can be somewhat cryptic (itâ€™s easy to expand them back though), and that the documentation [<a href="#Microsoft-c">Microsoft-c</a>] states that theyâ€™re not available with three Windows â€˜technologiesâ€™, namely SMB 3.0 Transparent Failover (TFO), SMB 3.0 with Scale-out File Shares (SO), and Cluster Shared Volume File System (CsvFS), which I read as network drives (?).</p></td>
				</tr>
			</table>
		</td>
	</tr>
</table>

<h2>Invalid-as-UTF-8 bytes, how, what?</h2>

<p>Nartr text bytes that are invalid as UTF-8 can occur due to a number of possible reasons, e.g. just passing raw <code>main</code> arguments to <code>cout</code>, or doing conversion from wide text to the nartr encoding of the userâ€™s native locale, which in Windows cannot be UTF-8.</p>

<p>When this happens, itâ€™s in my opinion best if it doesnâ€™t stop output of further text, or indeed, of the text containing the bad bytes.</p>

<p><strong>stdlib</strong> just replaces each bad byte with ASCII 127, <code>DEL</code> (see Listing 9). The result of the <strong>stdlib</strong>-based code is in Figure 9 â€“ it works the same with g++.</p>

<table class="sidebartable">
	<tr>
		<td><img src="/content/images/journals/ol140/Steinbach/Steinbach_18.jpg" /></td>
	</tr>
	<tr>
		<td class="title">Listing 9</td>
	</tr>
</table>

<table class="sidebartable">
	<tr>
		<td><img src="/content/images/journals/ol140/Steinbach/Steinbach_19.jpg" /></td>
	</tr>
	<tr>
		<td class="title">Figure 9</td>
	</tr>
</table>

<p>The corresponding <strong>nowide</strong>-based code is in Listing 10 and the result of the <strong>nowide</strong>-based code is in Figure 10.</p>

<table class="sidebartable">
	<tr>
		<td><img src="/content/images/journals/ol140/Steinbach/Steinbach_20.jpg" /></td>
	</tr>
	<tr>
		<td class="title">Listing 10</td>
	</tr>
</table>

<table class="sidebartable">
	<tr>
		<td><img src="/content/images/journals/ol140/Steinbach/Steinbach_21.jpg" /></td>
	</tr>
	<tr>
		<td class="title">Figure 10</td>
	</tr>
</table>

<h2>Summary</h2>

<p>There are currently two C++ libraries for UTF-8 console i/o in Windows: the authorâ€™s <strong>stdlib</strong>, and the <strong>nowide</strong> library recently adopted in Boost. With <strong>stdlib</strong>, existing textbook code can work for Unicode console i/o in Windows, and since itâ€™s a header only library itâ€™s easy to use for novices. With <strong>nowide</strong> there is separate compilation, which can be a barrier to novices, and oneâ€™s code must be modified to explicitly use the <strong>nowide</strong> functionality, which also means that existing, unmodified code doesnâ€™t benefit from <strong>nowide</strong>.</p>

<p>As of this writing, console input just didnâ€™t work correctly with <strong>nowide</strong> â€“it included carriage return characters in input lines. </p>

<p>The <strong>nowide</strong> libraryâ€™s <code>nowide::ifstream</code> (&amp; family) can be very useful as a workaround for MinGW g++â€™s current filesystem library implementation deficiencies, when one controls the file opening code. The corresponding <strong>stdlib</strong> fix <code>stdlib::char_path</code> is based on Windowsâ€™ alternative ASCII names, which is easy to use and supports 3rd party library functions such as with OpenCV. Itâ€™s guaranteed to work for a path that can be represented exactly with Windows ANSI encoding, plus this approach has worked for general Unicode existing paths on all the myriad local Windows systems that the author has used. I.e. itâ€™s not a perfect fix, but simple and usually Good Enoughâ„¢.</p>

<h2>References</h2>

<p class="bibliomixed"><a id="[Boost-a]"></a>[Boost-a] At <a href="http://www.boost.org/doc/libs/1_48_0/libs/locale/doc/html/default_encoding_under_windows.html">http://www.boost.org/doc/libs/1_48_0/libs/locale/doc/html/default_encoding_under_windows.html</a></p>

<p class="bibliomixed"><a id="[Boost-b]"></a>[Boost-b] Boost acceptance of NoWide: <a href="https://lists.boost.org/boost-announce/2017/06/0516.php">https://lists.boost.org/boost-announce/2017/06/0516.php</a></p>

<p class="bibliomixed"><a id="[Boost-c]"></a>[Boost-c] Referred to in a 2011 discussion between the Boost Filesystem creator Beman Dawes and the author, titled â€˜Making Boost.Filesystem work with GENERAL filenames with g++ in Windows (a solution), at <a href="https://lists.boost.org/Archives/boost/2011/10/187282.php">https://lists.boost.org/Archives/boost/2011/10/187282.php</a></p>

<p class="bibliomixed"><a id="[C99]"></a>[C99] C99 Â§7.17/2 (I used the N1256 draft, roughly C99 + TC1 + TC2 + TC3, for the quote).</p>

<p class="bibliomixed"><a id="[Ermey]"></a>[Ermey] Quoted from <a href="https://www.brainyquote.com/quotes/quotes/r/rleeermey464853.html">https://www.brainyquote.com/quotes/quotes/r/rleeermey464853.html</a></p>

<p class="bibliomixed"><a id="[ICU]"></a>[ICU] The International Components for Unicode library, available at <a href="http://site.icu-project.org/">http://site.icu-project.org/</a></p>

<p class="bibliomixed"><a id="[Kaplan08]"></a>[Kaplan08] Still available at <a href="http://archives.miloush.net/michkap/archive/2008/03/18/8306597.html">http://archives.miloush.net/michkap/archive/2008/03/18/8306597.html</a></p>

<p class="bibliomixed"><a id="[Microsoft-a]"></a>[Microsoft-a] Quoting Microsoftâ€™s documentation of <code>setlocale</code>: â€œ<em>If you provide a code page value of UTF-7 or UTF-8, </em><code>setlocale</code><em> will fail, returning </em><code>NULL</code><em>.</em>â€ ATTOW that documentation was available at <a href="https://docs.microsoft.com/en-us/cpp/c-runtime-library/reference/setlocale-wsetlocale">https://docs.microsoft.com/en-us/cpp/c-runtime-library/reference/setlocale-wsetlocale</a></p>

<p class="bibliomixed"><a id="[Microsoft-b]"></a>[Microsoft-b] <code>_setmode</code> docs at <a href="https://docs.microsoft.com/en-us/cpp/c-runtime-library/reference/setmode">https://docs.microsoft.com/en-us/cpp/c-runtime-library/reference/setmode</a></p>

<p class="bibliomixed"><a id="[Microsoft-c]"></a>[Microsoft-c] Windows API function GetShortPathName documentation, at <a href="https://msdn.microsoft.com/en-us/library/windows/desktop/aa364989(v=vs.85).aspx">https://msdn.microsoft.com/en-us/library/windows/desktop/aa364989(v=vs.85).aspx</a></p>

<p class="bibliomixed"><a id="[nowide]"></a>[nowide] The NoWide library is available at <a href="http://cppcms.com/files/nowide/html/index.html">http://cppcms.com/files/nowide/html/index.html</a></p>

<p class="bibliomixed"><a id="[stdlib]"></a>[stdlib] The <strong>stdlib</strong> library is available at <a href="https://github.com/alf-p-steinbach/stdlib">https://github.com/alf-p-steinbach/stdlib</a></p>

<p class="bibliomixed"><a id="[Steinbach11]"></a>[Steinbach11] â€˜Unicode part 1: Windows console i/o approachesâ€™, at <a href="https://alfps.wordpress.com/2011/11/22/unicode-part-1-windows-console-io-approaches/">https://alfps.wordpress.com/2011/11/22/unicode-part-1-windows-console-io-approaches/</a></p>

<p class="bibliomixed"><a id="[Steinbach13]"></a>[Steinbach13] â€˜Portable String Literals in C++â€™, <em>Overload</em> #116, August 2013, available at <a href="https://accu.org/index.php/articles/1842">https://accu.org/index.php/articles/1842</a></p>

<p class="footnotes"></p>

<ul>
	<li><a id="FN01"></a>The BMP, the <em>Basic Multilingual Plane</em>, is Unicode restricted to 16 bits, like in Unicode version 1 in 1991/1992. The 21-bit version 2 came in 1996. By that time Microsoft had committed to 16-bit Unicode. Unicode 2â€™s UTF-16 encoding was designed to allow the existing 16-bit Unicode systems (various programming languages, + Windows) to just keep on working; a backward-compatible encoding. So most of Windows uses full UTF-16, but Windows console windows have a non-streaming API that restricts each character position to 16 bits. Hence if you output an UTF-16 surrogate pair (representing a Unicode code point outside the BMP, e.g. an emoji or an archaic Chinese glyph) to a Windows console window, you get two characters displayed, probably as â€œI didnâ€™t understand thatâ€ squares.</li>

	<li><a id="FN02"></a>To change the default font for a Windows console window, just right click the window title for a menu, and drill down into it</li>

	<li><a id="FN03"></a>Using the name â€œcatâ€, expressed as Russian â€œÐºÐ¾ÑˆÐºÐ°â€, for an executable that lists the contents of a multi-language text file, is a weak pun. It was the best I could do.</li>

	<li><a id="FN04"></a>ATTOW these conversion functions are limited to UTF-16 for wide text, e.g. they canâ€™t (properly) handle emojis in the *nix world. I intend to remove that limitation, but must do one thing at a time.</li>

	<li><a id="FN05"></a>C++17 Â§30.9.1/3 requires wide string filename argument support for iostreams implementations on systems with wide native paths. Prior to C++17 this was a Visual C++ extension of the standard library.</li>
</ul>
</p>
<p><strong>Notes:</strong>&nbsp;</p>
<p><em>More fields may be available via dynamicdata ..</em></p>
</div>
</channel>
</rss>
