    <rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:sy="http://purl.org/rss/1.0/modules/syndication/" xmlns:admin="http://webns.net/mvcb/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:content="http://purl.org/rss/1.0/modules/content/">
     <channel>
        <title>ACCU  :: Close Encounters with Internationalisation</title>
        <link>https://members.accu.org/index.php/articles/747</link>
        <description>Professionalism in Programming</description>
        <dc:language>en-us</dc:language> 
        <dc:creator>Administrator</dc:creator> 
        <admin:generatorAgent rdf:resource="http://www.xaraya.org" /> 
        <admin:errorReportsTo rdf:resource="mailto:webeditor@accu.org" />
       <sy:updatePeriod>hourly</sy:updatePeriod>
       <sy:updateFrequency>1</sy:updateFrequency>
       <docs>http://backend.userland.com/rss</docs>




<div class="xar-mod-head"><span class="xar-mod-title">Programming Topics + CVu Journal Vol 11, #1 - Nov 1998</span></div>

<table border="0" cellpadding="1" cellspacing="0">
    <tbody>
    <tr>
        <td valign="top">
            Browse in :
       </td>
       <td valign="top">

                                            <a href="https://members.accu.org/index.php/articles/">All</a>

                     &gt;                         <a href="https://members.accu.org/index.php/articles/c13/">Topics</a>

                     &gt;                         <a href="https://members.accu.org/index.php/articles/c65/">Programming</a>
<br />

                                            <a href="https://members.accu.org/index.php/articles/">All</a>

                     &gt;                         <a href="https://members.accu.org/index.php/articles/c76/">Journals</a>

                     &gt;                         <a href="https://members.accu.org/index.php/articles/c77/">CVu</a>

                     &gt;                         <a href="https://members.accu.org/index.php/articles/c134/">111</a>
<br />

                                            <a href="https://members.accu.org/index.php/articles/c65-134/">Any of these categories</a>

                    -                        <a href="https://members.accu.org/index.php/articles/c65+134/">All of these categories</a>
<br />
</td>
   </tr>
   </tbody>
</table>




<div class="xar-error">
   <p>
 <strong>Note:</strong> when you create a new publication type,
the articles module will automatically use the templates
<em>user-display-[publicationtype].xt</em>
and <em>user-summary-[publicationtype].xt</em>.
If those templates do not exist when you try to preview or display a new article,
you'll get this warning :-)  Please place your own templates in themes/<em>yourtheme</em>/modules/articles . The templates will get the extension .xt there. </p>
</div>
<div class="xar-norm xar-standard-box-padding">
   <h1><strong>Title:</strong>&nbsp;Close Encounters with Internationalisation</h1>
<p><strong>Author:</strong>&nbsp;</p>
<p>
<strong>Date:</strong> 03 November 1998 13:15:28 +00:00 or Tue, 03 November 1998 13:15:28 +00:00</p>
<p><strong>Summary:</strong>&nbsp;</p>
<p><strong>Body:</strong>&nbsp;<div class="sect1" lang="en">
<div class="titlepage">
<h2><a name="d0e20" id="d0e20"></a></h2>
</div>
<p>I have recently been involved in a project developing a user
interface for some equipment that was aimed at the Japanese market.
This is the first time I've been involved in any project which
supports more than one language, and it was made all the more
interesting in that the language was not one which was based on the
Roman alphabet with which those of us who live in the West know so
well. I had to do some background work in order to get an
understanding of the complexities of character sets and after
trawling through much documentation, both paper and web based, I
decided that it would be profitable to put down what I'd learned on
paper. My main objective in doing so was consolidating what I'd
learned in a form that would be useful for future reference.
However, submitting it to C Vu makes my efforts far more worthwhile
as my experience may prove useful to others, and hopefully the
questions which I throw up will generate some response.</p>
</div>
<div class="sect1" lang="en">
<div class="titlepage">
<h2><a name="d0e24" id="d0e24"></a>Japanese
Character Sets</h2>
</div>
<p>Before I launch into the ins and outs of character sets, let me
set the scene with some background information on the Japanese
writing system. The Japanese have, for hundreds of years, been
borrowing characters from the Chinese to supplement their alphabet.
As a result of this their main alphabet - Kanji, which literally
translates as &quot;Chinese character&quot; - has thousands of characters. As
learning enough to read any major work of literature is a daunting
task, the Japanese supplement Kanji with two phonetic alphabets -
Hiragana and Katakana. Hiragana is used to phonetically sound
Japanese words, and is used to sound less common written words,
thereby doing away with the requirement of a formidable education
and good memory. Katakana is used to sound foreign words. The
actual soundings are pretty much the same as those represented by
Hiragana, its just that different alphabets are used for native and
foreign words.</p>
<div class="sect2" lang="en">
<div class="titlepage">
<h3><a name="d0e29" id="d0e29"></a>Character
Representation &amp; Encoding</h3>
</div>
<p>The task of representing Kanji within com-puters, together with
Kanji I/O operations, is one which has taken up some effort over
the last few decades, and one which many software engineers in the
West have been largely ignorant of. As with most software engineers
in the West, I'd taken ASCII for granted and never thought about it
too much until I was faced with alphabets that were radically
different to the Roman alphabet. I found that I had to clarify my
ideas on character set representation and encoding. Firstly (and it
seems as though I'm stating the obvious here) the set of characters
available is known as the character set. The character set makes no
reference to how the characters are actually encoded, it only
defines which characters are available. The encoding scheme assigns
a numeric representation to each character. A sensible encoding
scheme orders the characters in some useful way, as ASCII does.
Ultimately the encoding is a trade off between useful ordering and
efficient implementation.</p>
<p>The Japanese have three main character set standards, and they
are cryptically referred to as JIS-X-0201, JIS-X-0208 and
JIS-X-0212. Strictly speaking I should postfix the numbers with a
four digit year specifying which version of the standard I'm
referring to, but lets keep things simple for now. JIS is the
Japanese Industry Standard body. JIS-X-0201 is very similar to
ASCII in the first 128 characters, the only differences being that
the tilde character is replaced by an overbar and the backslash by
a Yen sign. The `top half' of the character set contains Katakana
characters. All characters in JIS-X-0201 can be represented in an
8*8 grid - Katakana is good that way.</p>
<p>JIS-X-0208 defines the main set of Kanji used. This contains
most of the Kanji in everyday use. The character set also contains
the Hiragana characters, the Katakana characters together with a
generous selection of non-Japanese characters including Roman,
Greek and Russian. Although the character set standard does not
define any encoding scheme all characters are assigned reference
numbers. These are based on their position in the table which the
standard provides, and have the format XXYY. They are known as
Kuten codes - translated, Kuten means &quot;row, column&quot;.</p>
<p>JIS-X-0212 defines a further set of Kanji characters. These
characters do not overlap with those in JIS-X-0208 and are less
common Kanji. Failure to support JIS-X-0212 is not a tragedy as all
Kanji can be easily represented with Hiragana soundings. Failing to
support JIS-X-0208 is a failing as these Kanji are in everyday use
and it's tedious having to use Hiragana.</p>
</div>
<div class="sect2" lang="en">
<div class="titlepage">
<h3><a name="d0e40" id="d0e40"></a>Character
Encodings</h3>
</div>
<p>Armed with a thorough understanding of Japanese character set
standards it is time to consider how they are actually encoded.
Firstly, Kuten codes are not used to represent characters within a
computer. Frequently there is a mapping between the Kuten and the
encoding scheme, but Kuten is not used to encode characters. One
commonly used encoding is Shift-JIS, which was developed by
Microsoft. This encodes characters from both JIS-X-0201 and
JIS-X-0208. The latter character set is encoded using two byte
characters, whilst the former encodes to single byte characters.
Shift-JIS is popular for storing data as it is efficient
memory-wise. On the negative side it cannot be extended to encode
JIS-X-0212 as there is no room to do so within the encoding scheme.
A more serious drawback is that if a string of Shift-JIS characters
is stored in an array of <tt class="type">char</tt>s which is
pointed to by <tt class="varname">char *ptr</tt>, then it is
impossible to determine whether <tt class="literal">*(ptr+i)</tt>
is a single byte character or part of a double byte character
without parsing the text from the beginning of the string. This
makes text manipulation cumbersome. Unicode provides another
character set encoding. One disadvantage of Unicode is it's
incompatibility with other encoding schemes - it isn't possible to
convert between encoding schemes without lookup tables. Another one
is that it's multibyte characters frequently contain zero values -
the string terminator in C and C++. I'll say no more beyond noting
that the advantages of Unicode (for those supporting many
languages) are obvious.</p>
</div>
<div class="sect2" lang="en">
<div class="titlepage">
<h3><a name="d0e54" id="d0e54"></a>The place of C in
all this</h3>
</div>
<p>Those who are au fait with C standards will already know all
about wide-characters. I must confess that I only learned about
them within the last year, and only as a result of purchasing and
perusing &quot;<i class="citetitle">Standard C: A Reference</i>&quot; by
Plauger &amp; Brodie. As the user interface which I was working on
used the shift-JIS implementation, I was faced with text
manipulation problems related to string truncation and wrapping
text over several lines (you don't want to cut a string halfway
though a double byte character). Wide characters sounded like the
solution I was after.</p>
<p>I took some time to read up on what wide character support was
provided by the standard, and I'll summarise it briefly here, just
to whet your appetite. Conversion from multibyte strings to wide
character strings is achieved using <tt class=
"function">mbsrtowcs()</tt>. The reverse process is performed by
<tt class="function">wcsrtombs()</tt>. It is possible to convert
one character at a time should this be necessary - the standard
does not hide the details behind these functions - but for most
users this level of control is unnecessary. There are character
classification functions for wide characters - the function names
are easily remembered - <tt class="function">isupper()</tt> becomes
<tt class="function">iswupper()</tt>. This naming convention is
followed throughout the character classification functions. There
are also wide character equivalents for stream handling -
<tt class="function">fwprintf()</tt>, for instance - and for wide
string manipulation. Virtually anything that can be done with
strings can be done with their wide character brothers, which is
how it should be.</p>
<p>Setting the locale using <tt class="function">set_locale()</tt>
controls the behaviour of wide character string functions. It was
here that I ran into difficulties. I was confident that my compiler
didn't come with any Kanji wide character routines, and if I had to
write them then how did I tie this into the locale so that all the
wonderful wide character routines were supported? I searched
through the compiler manuals, my C reference manuals and surfed the
web for information but to no avail. Part of my problem was the
target environment - an embedded microprocessor. This means minimal
operating system support, particularly in the area of locales and
other such niceties. Indeed, the description of <tt class=
"function">set_locale()</tt> in the compiler manual was &quot;Sets the
current locale&quot;. That's it - no more, no less.</p>
<p>I don't really understand locales, and I don't have a great deal
of time to find out more. It isn't obvious where to look. The
project has moved on and the locale solution that looks very nice
and elegant has been passed over in favour of a library of
shift-JIS functions. There are one or two gurus who know what the
functions do, and everyone else hopes they'll never be called upon
to manipulate shift-JIS strings. I'm left thinking that it could
have been really good, but we were defeated by a lack of both
knowledge and time. If anyone does have the time and knowledge to
produce a brief article filling in the gaps in my knowledge then
I'd certainly appreciate it.</p>
<p class="c2"><span class="remark">Well how about it?</span></p>
</div>
</div>
</p>
<p><strong>Notes:</strong>&nbsp;</p>
<p><em>More fields may be available via dynamicdata ..</em></p>
</div>
</channel>
</rss>
