Title: Use UTF-16 Interfaces to Ship Windows Code

Author: Bob Schmidt

Date: 03 June 2019 19:01:46 +01:00 or Mon, 03 June 2019 19:01:46 +01:00

Summary: Character encoding can cause problems. PÃ©ter Ã‰sik explains why UTF-16 interfaces help on Windows.

Body:

Listing 1 is a small program that takes a file path as a parameter, and queries its size. Even though stat is a POSIX function, it happens to be available on Windows as well, so this small program works on both POSIX platforms and Windows. Or does it? Letâ€™s try it out with two test files. For test.txt, it correctly reports the fileâ€™s size. For Hello, Ð¼Ð¸Ñ€.txt, however, the stat call fails (on my machine), even though the file clearly exists (see Figure 1). Why is that?

#include <iostream>
#include <sys/stat.h>

int main (int /*argc*/, char* argv[])
{
  struct stat fileInfo;
  if (stat (argv[1], &fileInfo) == 0) {
     std::cout << "The file's size is: " 
       << fileInfo.st_size << " bytes\n";
  } else {
    std::cout << "Unable to stat file (maybe it
      doesn't exist?)\n";
  }
}

Listing 1

Figure 1

The â€˜ANSIâ€™ vs. UTF-16 story in five minutes (or less)

Back in the day, character encodings were quite rudimentary. The first encoding widely adopted by computer systems was ASCII [Wikipedia-1] (a 7-bit encoding) , capable of encoding the English alphabet and some other characters (numbers, mathematical symbols, control characters, etc.). Of course, the obvious need arose to encode more characters as users expected computers to speak their language, to type their native letters in e-mails, etc. 8-bit encodings provided a partial solution: code points 0â€“127 were the same as ASCII (for compatibility), while extra characters were encoded in the range 128â€“255.¹ Those extra 128 code points were not enough to encode all letters of all languages at once, so character mappings were applied, most commonly known as code pages.

This means that a character or a string thatâ€™s encoded like this has no meaning in itself, you need to know what code page to interpret it with (this is somewhat analogous with files and their extensions). For example, code point 0x8A means Ã¤ (lowercase a with an umlaut) if you interpret it using the Macintosh Central European encoding [Wikipedia-2], but encodes Å (uppercase S with caron) if you use the Windows-1252 [Wikipedia-3] (Latin alphabet) code page.

This approach has two obvious problems: first, itâ€™s easy to get encodings wrong (for instance, .txt files have no header, so you simply canâ€™t store the code page used), resulting in so-called mojibake [Wikipedia-4]. Second, you canâ€™t mix and match characters with different encodings easily. For example, if you wanted to encode the string "Ð¨Ð½ÑƒÑ€ÐºÐ¸ means cipÅ‘fÅ±zÅ‘" (with Windows code pages), you would have to encode "Ð¨Ð½ÑƒÑ€ÐºÐ¸" with code page 1251 (Windows Cyrillic) [Wikipedia-5], " means " with a code page of your choice (as it contains ASCII characters only), and "cipÅ‘fÅ±zÅ‘" with code page 1250 (Windows Central European) [Wikipedia-6]. To correctly decode and display this string later, you would have to store which code pages were used for which parts, making string handling inefficient and extremely complex.

Because of problems like these, encodings were desired that could represent â€˜allâ€™ characters at once. One of these emerging encodings was UCS-2 (by the Unicode working group), which used 16-bit wide code units and code points. Windows adopted UCS-2 quite early, Windows NT 3.1 (the very first OS of the NT series, released in 1993) and its file system, NTFS, used it internally. Even though the 32-bit Windows API debuted with NT 3.1 as brand new, support for 8-bit encodings was still necessary.² As UCS-2 used 16-bit code units, and the C language does not support function overloading, Microsoft introduced two versions of every API function that had to work with strings (either directly or indirectly): a UCS-2 version, with a W suffix (â€˜wideâ€™, working with wchar_t strings), and one for 8-bit code paged strings, with an A suffix (â€˜ANSIâ€™³, working with char strings).

The A functions act as mere wrappers, usually⁴ they just convert the string parameters and forward the call to the corresponding W version. So for example, there is no such function as MessageBox, there is only MessageBoxA, and MessageBoxW. Depending on the strings you have, you need to call the appropriate version of the two.⁵

Which code page is used to interpret strings in the A family of functions? Is there a code page parameter passed? No, they use a system-wide setting called the active code page, located in the registry at HKLM\SYSTEM\CurrentControlSet\Control\Nls\CodePage\ACP. This value is decided based on your region you choose at installation time, but can also be changed later in the Control Panel.

Eventually, UCS-2 evolved into UTF-16, and starting with Windows 2000, the OS had support for it. Since UCS-2 is fully compatible with UTF-16, programs didnâ€™t need to be rewritten or even recompiled.

Back to the test program

Armed with this knowledge, itâ€™s easy to see why the small test program doesnâ€™t work for certain files. This is what happens:

The program is started (with whatever parameters).
Very early in the startup phase, Windows converts the (native) UTF-16 command line to an â€˜ANSIâ€™ string using the active code page, and stores it in a global variable.
Because regular main was used (with â€˜narrowâ€™, char parameters) in this application, early in the startup phase the CRT queries the command line with GetCommandLineA (this just returns the global that was set up by the previous step), converts it into an array, and passes it down to main.

The problem is that there might be characters in the UTF-16 command line that have no representation in the currently active code page. For example, my computerâ€™s locale is set to Hungarian, therefore my ACP is 1250 (Windows Central European) [Wikipedia-6]. Cyrillic characters such as Ð¼, Ð¸, and Ñ€ have no representation in this encoding, so when the UTF-16 to â€˜ANSIâ€™ conversion is performed, these characters are replaced with question marks (see Figure 2).⁶ When stat is called with the string "D:\temp\temp\Hello, ???.txt" (which by the way involves an â€˜ANSIâ€™ to UTF-16 conversion internally), of course it fails, because there is no file named Hello, ???.txt in that directory.

Figure 2

cURL

Itâ€™s not that hard to bump into applications or libraries suffering from these problems. cURL, for example, is one of them. Now donâ€™t get me wrong, Iâ€™m in no way saying that itâ€™s a badly written piece of software, quite the contrary. Itâ€™s a battle-tested, popular open source project with a long history and a plethora of users. Actually, I think this is what makes it a perfect example: even if your code is spot on, this aspect of shipping to Windows is very easy to overlook.

For file IO, cURL uses standard C functions (such as fopen). This means that for example, if you want your requestâ€™s result written into an output file, it will fail if the fileâ€™s path contains characters not representable with the systemâ€™s current â€˜ANSIâ€™ code page.

Another example is IDN (internationalized domain name) handling. cURL does support IDNs, but letâ€™s see what happens if I try it out using the standalone command line version (see Figure 3).

Figure 3

Even though magyarorszÃ¡g.icom.museum exists, and its string form is perfectly representable with my machineâ€™s ACP (1250), cURL fails with an error. Looking at the source code quickly reveals the culprit:

cURL needs to convert the IDN to so-called Punycode [Wikipedia-7] before issuing the request.
It does so with IdnToAscii [WindowsDev], but this function expects a UTF-16 input string.
Even though the original string (which originates from the command line parameter) is an â€˜ANSIâ€™ string, conversion to UTF-16 is attempted assuming itâ€™s UTF-8. This makes the conversion fail, and thus cURL aborts with an error message.

cURL developers are aware of this category of problems: itâ€™s listed on their known bugs page [curl].

Solution

The solution is in the title of this article: use UTF-16 (native) interfaces on Windows. That is:

Instead of regular main, use wmain as an entry point, which has wchar_t string arguments.
Always use the wide version of runtime functions (_wfopen_s over plain fopen, wcslen instead of strlen, etc.).
If you need to call Win32 functions directly, never use the â€˜ANSIâ€™ version with the A suffix, use their UTF-16 counterparts (ending with W).

While this sounds great on paper, there is a catch. You can only use these functions on Windows, as:

Some of the widechar runtime functions are Windows-only (such as _wfopen).
The size and semantics of wchar_t are implementation defined. While on Windows itâ€™s a 2-byte type representing a UTF-16 code unit, on POSIX systems itâ€™s usually 4 bytes in size, embodying a UTF-32 code unit.

One possible solution is to utilize typedefs and macros. See Listing 2.

#ifdef _WIN32

using nchar = wchar_t;
using nstring = std::wstring;

#define NSTRLITERAL(str) L##str
#define nfopen _wfopen
/* ... */

#else

using nchar = char;
using nstring = std::string;

#define NSTRLITERAL(str) str
#define nfopen fopen
/* ... */

#endif  // #ifdef _WIN32

Listing 2

This simple technique can go a long way (it can be done somewhat more elegantly, but you get the idea), unless you need to exchange strings between different platforms (over the network, serialization, etc.).

Closing thoughts

I know some people think that the problem presented in this article is marginal, and using char strings and â€˜ANSIâ€™ interfaces on Windows is â€˜good enoughâ€™. Keep in mind though that in commercial environments the following situation is not that rare:

Company X outsources work to company Y, but they reside in different parts of the world.
Therefore, the computers of company Y have a different ACP from those of company X.
Itâ€™s very likely that the outsourced work involves using strings in company Xâ€™s locale, which will be problematic on Windows, if the software(s) used for doing said work misbehaves in this situation.

Donâ€™t be surprised if a potential client of yours turns down a license purchase because of problems like this.

References

[curl] â€˜Known Bugsâ€™, curl: https://curl.haxx.se/docs/knownbugs.html#can_t_handle_Unicode_arguments_i

[Microsoft18] â€˜Working with stringsâ€™, published on 31 May 2018 at https://docs.microsoft.com/en-gb/windows/desktop/LearnWin32/working-with-strings

[Wikipedia-1] ASCII: https://en.wikipedia.org/wiki/ASCII

[Wikipedia-2] Macintosh Central European Encoding: https://en.wikipedia.org/wiki/Macintosh_Central_European_encoding

[Wikipedia-3] Windows -1252 code page: https://en.wikipedia.org/wiki/Windows-1252

[Wikipedia-4] Mojibake: https://en.wikipedia.org/wiki/Mojibake

[Wikipedia-5] Windows-1251: https://en.wikipedia.org/wiki/Windows-1251

[Wikipedia-6] Windows-1250: https://en.wikipedia.org/wiki/Windows-1250

[Wikipedia-7] Punycode: https://en.wikipedia.org/wiki/Punycode

[WindowsDev] IdnToAscii function, Windows Dev Center: https://docs.microsoft.com/en-gb/windows/desktop/api/winnls/nf-winnls-idntoascii

There are some languages with much more symbols than 128 or 255 (Japanese, Chinese, etc.), which led to the invention of DBCS/MBCS character sets. Iâ€™m not mentioning them here for simplicity.
One major reason for this was (other than UCS-2 not being widespread at the time) that the consumer line of Windows OSes (95, 98, etc.) had very limited support for UCS-2, but applications targeting Win32 had to run on both lines of Windows.
Technically, itâ€™s not correct to call these functions â€˜ANSIâ€™ versions, as none of the supported code pages are ANSI standards. This term has historical roots, as the first Windows code page (1252) was based on an ANSI draft. On recent versions of Windows 10, the ACP can be set to UTF-8. Therefore, itâ€™s best to think about â€˜ANSI code pagesâ€™ as â€˜some encoding thatâ€™s not UTF-16â€™.
One exception I know of is OutputDebugString, where the â€˜ANSIâ€™ version is the native one (OutputDebugStringW will convert to â€˜ANSIâ€™ and call OutputDebugStringA)
Itâ€™s possible to create programs that can be compiled to support either the W or A interfaces without source changes, using predefined macros [Microsoft18]. Nowadays, however, thatâ€™s highly irrelevant. If you are writing programs that target modern Windows versions (only NT), there is almost absolutely no reason to use A interfaces.
The exact mappings are defined in .nls files located in the System32 directory.

PÃ©ter Ã‰sik PÃ©ter has been working as a C++ software developer for 5 years. He has a knack for everything low level, including (but not limited to) OS internals, assembly, and post-mortem crash analysis. His blog can be found at http://peteronprogramming.wordpress.com.

Notes:

More fields may be available via dynamicdata ..