Title: Close Encounters with Internationalisation

Author:

Date: 03 November 1998 13:15:28 +00:00 or Tue, 03 November 1998 13:15:28 +00:00

Summary:

Body:

I have recently been involved in a project developing a user interface for some equipment that was aimed at the Japanese market. This is the first time I've been involved in any project which supports more than one language, and it was made all the more interesting in that the language was not one which was based on the Roman alphabet with which those of us who live in the West know so well. I had to do some background work in order to get an understanding of the complexities of character sets and after trawling through much documentation, both paper and web based, I decided that it would be profitable to put down what I'd learned on paper. My main objective in doing so was consolidating what I'd learned in a form that would be useful for future reference. However, submitting it to C Vu makes my efforts far more worthwhile as my experience may prove useful to others, and hopefully the questions which I throw up will generate some response.

Japanese Character Sets

Before I launch into the ins and outs of character sets, let me set the scene with some background information on the Japanese writing system. The Japanese have, for hundreds of years, been borrowing characters from the Chinese to supplement their alphabet. As a result of this their main alphabet - Kanji, which literally translates as "Chinese character" - has thousands of characters. As learning enough to read any major work of literature is a daunting task, the Japanese supplement Kanji with two phonetic alphabets - Hiragana and Katakana. Hiragana is used to phonetically sound Japanese words, and is used to sound less common written words, thereby doing away with the requirement of a formidable education and good memory. Katakana is used to sound foreign words. The actual soundings are pretty much the same as those represented by Hiragana, its just that different alphabets are used for native and foreign words.

Character Representation & Encoding

The task of representing Kanji within com-puters, together with Kanji I/O operations, is one which has taken up some effort over the last few decades, and one which many software engineers in the West have been largely ignorant of. As with most software engineers in the West, I'd taken ASCII for granted and never thought about it too much until I was faced with alphabets that were radically different to the Roman alphabet. I found that I had to clarify my ideas on character set representation and encoding. Firstly (and it seems as though I'm stating the obvious here) the set of characters available is known as the character set. The character set makes no reference to how the characters are actually encoded, it only defines which characters are available. The encoding scheme assigns a numeric representation to each character. A sensible encoding scheme orders the characters in some useful way, as ASCII does. Ultimately the encoding is a trade off between useful ordering and efficient implementation.

The Japanese have three main character set standards, and they are cryptically referred to as JIS-X-0201, JIS-X-0208 and JIS-X-0212. Strictly speaking I should postfix the numbers with a four digit year specifying which version of the standard I'm referring to, but lets keep things simple for now. JIS is the Japanese Industry Standard body. JIS-X-0201 is very similar to ASCII in the first 128 characters, the only differences being that the tilde character is replaced by an overbar and the backslash by a Yen sign. The `top half' of the character set contains Katakana characters. All characters in JIS-X-0201 can be represented in an 8*8 grid - Katakana is good that way.

JIS-X-0208 defines the main set of Kanji used. This contains most of the Kanji in everyday use. The character set also contains the Hiragana characters, the Katakana characters together with a generous selection of non-Japanese characters including Roman, Greek and Russian. Although the character set standard does not define any encoding scheme all characters are assigned reference numbers. These are based on their position in the table which the standard provides, and have the format XXYY. They are known as Kuten codes - translated, Kuten means "row, column".

JIS-X-0212 defines a further set of Kanji characters. These characters do not overlap with those in JIS-X-0208 and are less common Kanji. Failure to support JIS-X-0212 is not a tragedy as all Kanji can be easily represented with Hiragana soundings. Failing to support JIS-X-0208 is a failing as these Kanji are in everyday use and it's tedious having to use Hiragana.

Character Encodings

Armed with a thorough understanding of Japanese character set standards it is time to consider how they are actually encoded. Firstly, Kuten codes are not used to represent characters within a computer. Frequently there is a mapping between the Kuten and the encoding scheme, but Kuten is not used to encode characters. One commonly used encoding is Shift-JIS, which was developed by Microsoft. This encodes characters from both JIS-X-0201 and JIS-X-0208. The latter character set is encoded using two byte characters, whilst the former encodes to single byte characters. Shift-JIS is popular for storing data as it is efficient memory-wise. On the negative side it cannot be extended to encode JIS-X-0212 as there is no room to do so within the encoding scheme. A more serious drawback is that if a string of Shift-JIS characters is stored in an array of chars which is pointed to by char *ptr, then it is impossible to determine whether *(ptr+i) is a single byte character or part of a double byte character without parsing the text from the beginning of the string. This makes text manipulation cumbersome. Unicode provides another character set encoding. One disadvantage of Unicode is it's incompatibility with other encoding schemes - it isn't possible to convert between encoding schemes without lookup tables. Another one is that it's multibyte characters frequently contain zero values - the string terminator in C and C++. I'll say no more beyond noting that the advantages of Unicode (for those supporting many languages) are obvious.

The place of C in all this

Those who are au fait with C standards will already know all about wide-characters. I must confess that I only learned about them within the last year, and only as a result of purchasing and perusing "Standard C: A Reference" by Plauger & Brodie. As the user interface which I was working on used the shift-JIS implementation, I was faced with text manipulation problems related to string truncation and wrapping text over several lines (you don't want to cut a string halfway though a double byte character). Wide characters sounded like the solution I was after.

I took some time to read up on what wide character support was provided by the standard, and I'll summarise it briefly here, just to whet your appetite. Conversion from multibyte strings to wide character strings is achieved using mbsrtowcs(). The reverse process is performed by wcsrtombs(). It is possible to convert one character at a time should this be necessary - the standard does not hide the details behind these functions - but for most users this level of control is unnecessary. There are character classification functions for wide characters - the function names are easily remembered - isupper() becomes iswupper(). This naming convention is followed throughout the character classification functions. There are also wide character equivalents for stream handling - fwprintf(), for instance - and for wide string manipulation. Virtually anything that can be done with strings can be done with their wide character brothers, which is how it should be.

Setting the locale using set_locale() controls the behaviour of wide character string functions. It was here that I ran into difficulties. I was confident that my compiler didn't come with any Kanji wide character routines, and if I had to write them then how did I tie this into the locale so that all the wonderful wide character routines were supported? I searched through the compiler manuals, my C reference manuals and surfed the web for information but to no avail. Part of my problem was the target environment - an embedded microprocessor. This means minimal operating system support, particularly in the area of locales and other such niceties. Indeed, the description of set_locale() in the compiler manual was "Sets the current locale". That's it - no more, no less.

I don't really understand locales, and I don't have a great deal of time to find out more. It isn't obvious where to look. The project has moved on and the locale solution that looks very nice and elegant has been passed over in favour of a library of shift-JIS functions. There are one or two gurus who know what the functions do, and everyone else hopes they'll never be called upon to manipulate shift-JIS strings. I'm left thinking that it could have been really good, but we were defeated by a lack of both knowledge and time. If anyone does have the time and knowledge to produce a brief article filling in the gaps in my knowledge then I'd certainly appreciate it.

Well how about it?

Notes:

More fields may be available via dynamicdata ..