    <rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:sy="http://purl.org/rss/1.0/modules/syndication/" xmlns:admin="http://webns.net/mvcb/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:content="http://purl.org/rss/1.0/modules/content/">
     <channel>
        <title>ACCU  :: A Short History of Character Sets</title>
        <link>https://members.accu.org/index.php/journals/1168</link>
        <description>Professionalism in Programming</description>
        <dc:language>en-us</dc:language> 
        <dc:creator>Administrator</dc:creator> 
        <admin:generatorAgent rdf:resource="http://www.xaraya.org" /> 
        <admin:errorReportsTo rdf:resource="mailto:webeditor@accu.org" />
       <sy:updatePeriod>hourly</sy:updatePeriod>
       <sy:updateFrequency>1</sy:updateFrequency>
       <docs>http://backend.userland.com/rss</docs>


        <h2>Journal Articles</h2>


<div class="xar-mod-head"><span class="xar-mod-title">CVu Journal Vol 14, #3 - Jun 2002 + Design of applications and programs</span></div>

<table border="0" cellpadding="1" cellspacing="0">
    <tbody>
    <tr>
        <td valign="top">
            Browse in :
       </td>
       <td valign="top">

                                            <a href="https://members.accu.org/index.php/journals/">All</a>

                     &gt;                         <a href="https://members.accu.org/index.php/journals/c76/">Journals</a>

                     &gt;                         <a href="https://members.accu.org/index.php/journals/c77/">CVu</a>

                     &gt;                         <a href="https://members.accu.org/index.php/journals/c114/">143</a>
                    (9)
<br />

                                            <a href="https://members.accu.org/index.php/journals/">All</a>

                     &gt;                         <a href="https://members.accu.org/index.php/journals/c13/">Topics</a>

                     &gt;                         <a href="https://members.accu.org/index.php/journals/c67/">Design</a>
                    (236)
<br />

                                            <a href="https://members.accu.org/index.php/journals/c114-67/">Any of these categories</a>

                    -                        <a href="https://members.accu.org/index.php/journals/c114+67/">All of these categories</a>
<br />
</td>
   </tr>
   </tbody>
</table>




<div class="xar-error">
   <p>
 <strong>Note:</strong> when you create a new publication type,
the articles module will automatically use the templates
<em>user-display-[publicationtype].xt</em>
and <em>user-summary-[publicationtype].xt</em>.
If those templates do not exist when you try to preview or display a new article,
you'll get this warning :-)  Please place your own templates in themes/<em>yourtheme</em>/modules/articles . The templates will get the extension .xt there. </p>
</div>
<div class="xar-norm xar-standard-box-padding">
   <h1><strong>Title:</strong>&nbsp;A Short History of Character Sets</h1>
<p><strong>Author:</strong>&nbsp;</p>
<p>
<strong>Date:</strong> 03 June 2002 13:15:51 +01:00 or Mon, 03 June 2002 13:15:51 +01:00</p>
<p><strong>Summary:</strong>&nbsp;<p>In this article I will provide some background to character sets and character encodings.</p></p>
<p><strong>Body:</strong>&nbsp;<div class="sect1" lang="en">
<div class="titlepage">
<h2><a name="d0e20" id="d0e20"></a></h2>
</div>
<p>It is impossible to work with XML and not come across the
subject of character encodings. If XML is the markup language for a
document and characters are the atoms that make up the document,
then XML will need to have intimate knowledge of how the document
is encoded, to understand what a character is in this document.</p>
<p>And given the multitude of platforms, operating systems and
serialisation formats, that is no simple task. The design of the
Universal Character Set (or Unicode) was an attempt to standardise
how a character was represented in a computer and is thus an
important part of making XML a standard that is not dependent on
any underlying implementation. The various Universal Transformation
Formats (UTF) are a way of standardising how the UCS is encoded in
a serial format.</p>
<p>In this article I will provide some background to character sets
and character encodings. The focus is on what is needed to work
with XML parsers, as a preliminary to further articles in the
series. For this reason there are some areas (glyphs and
representation for example) that have not been covered.</p>
</div>
<div class="sect1" lang="en">
<div class="titlepage">
<h2><a name="d0e28" id="d0e28"></a>A Little
History...</h2>
</div>
<p>In the beginning were the dot and the dash... probably the
earliest form of character set and encoding was that used by Morse
for the telegraph, back in 1844, when he sent the famous first
message from Washington to Baltimore (&quot;WHAT HATH GOD WROUGHT&quot;).
Although not related to any base 2 encoding, Morse code was the
first attempt to represent alphabetic characters as a series of
bits (or, in this case, dots and dashes)<sup>[<a name="d0e33" href=
"#ftn.d0e33" id="d0e33">1</a>]</sup>. Morse code was a varying
length code, using one bit for the common characters &quot;E&quot; and &quot;T&quot;
and up to 6 bits for some punctuation characters. And, yes, I am
using the term &quot;bit&quot; rather loosely here.</p>
<p>Morse code worked well for human operators, but for mechanical
processing, a fixed length code would be a great improvement and in
1874 Baudot came up with a fixed length, 5-bit code to represent
characters. By defining a &quot;shift-in&quot; key, he managed to get about
60 characters/numbers out of the coding. The mechanics of reading
and writing this code were handled by a horrendously complicated
piece of apparatus, the &quot;keyboard&quot; being operated by five fingers
(two from the left hand and three from the right) and resembling a
very short piano.</p>
<p>Around the turn of the century a New Zealander, Donald Murray,
developed something that more closely resembles a typewriter, using
codes based on the Baudot set. The main criterion for the layout of
the codes was that common characters should create the least amount
of mechanical movement, so the letter &quot;E&quot; had the value 1 (followed
by &quot;A&quot;, &quot;S&quot; and &quot;U&quot;). The Western Union Telegraph Company licensed
the technology from Murray, and with a few changes to the code, it
was to remain as it was into the 1950's. In the 1930's the French
standards institute took the Baudot/Murray code and used it as the
basis for the ITA2 standard (&quot;International Telegraphy Alphabet
Nr.2&quot;, I have no idea what happened to Nr.1).</p>
<p>So far none of the codes make any use of lower case characters
or of formatting codes, although ITA2 did have codes for CR and LF.
It was left to the U.S. military to come up with a larger code set
that would contain the full set of upper and lower case English
characters, with numerals, punctuation and a set of control
characters. It was known as FIELDATA and can be seen as the
precursor to the ASCII set, the alphabetic characters being in
sorted order (&quot;a&quot; - &quot;z&quot;) and the numerals in numeric order. It was
a 6-bit code (the standard size of a character in those days).</p>
<p>In June 1963, on the basis of the FIELDATA codes, the American
Standards Institute (in reality IBM and AT&amp;T) created the
ASCII-63 standard (American Standard Code for Information
Interchange). ASCII-63 is almost recognizable to us, the control
codes are all below 0x20, space is 0x20 and then follow the
numbers, punctuation and upper case characters (with &quot;A&quot; as 0x41).
The only glaring omission in ASCII-63 is that there are no lower
case characters!</p>
<p>In October 1963 the ISO standards body decided that the world
needed lower case characters so these were added in, some minor
changes made to the punctuation characters and released the
standard as ECMA-6. In 1967 the ASA adopted the ECMA-6 and they
released it as ASCII-1967, a 7-bit code containing 128 character
codes that has remained in use until today.</p>
<p>Apart from some accented characters in the original Baudot code,
all the above codes contain only the standard English characters.
At first other countries started replacing some of the
characters/punctuation with their own national characters and
registering these changes as 7-bit ISO code sets. Unfortunately,
this caused total incompatibility between countries that wanted to
exchange data and so ISO extended the ASCII data set to be 8 bits
long, thereby doubling its size. The original 128 7 bit codes were
kept as before, and countries were able to utilize the other 128
codes for their national character sets. This resulted in an
explosion of differing code sets, mainly of the ISO 8859-n type,
but also including Shift-JIS, ISO-2022-JP, and J-EUC for example.
IBM trotted off to do its own thing by inventing EBCDIC, also an
8-bit codeset.</p>
<p>Clearly, the stage was set for another upgrade of the systems
that we use for representing characters in a computer. In brief, a
larger set of characters existed than could be represented by one
homogenous 8-bit code set. A 16-bit character set was needed.</p>
<p>Time for a standards committee! Or two in fact, the Unicode
Consortium and ISO/IEC. Fortunately for the sanity of programmers
everywhere, these two bodies have decided to cooperate and
effectively the two standards are interchangeable. The Unicode
Consortium (a special interest group of US manufacturers) was first
off the ground in defining a 16-bit, Unicode V 1.0, first published
in 1991, followed in 1993 by V 1.1. The ISO/IEC had been creating
something completely different, but with V1.1 of the Unicode
standard, they adopted that and it became ISO/IEC 10646 Universal
Multiple-Octet Coded Character Set (normally abbreviated to UCS).
Since then the ISO/IEC standards and the Unicode standards have
remained in step, the main difference being that Unicode is a 16
bit subset of the ISO/IEC 10646 32 bit character set, but for
practical purposes they are interchangeable.</p>
<p>Unicode V 2.0 arrived in 1996, followed by V 2.1 in 1998, V 3.0
in 1999 and V 3.1 in 2000.</p>
<p>The remainder of this article will examine Unicode / ISO/IEC
10646-1 in further detail. I will refer to Unicode to cover both
from now on, unless there are some differences to be pointed out.
Technically, Unicode is a subset of a vastly larger set of codes
covered by 10646.</p>
</div>
<div class="sect1" lang="en">
<div class="titlepage">
<h2><a name="d0e57" id="d0e57"></a>Some
definitions</h2>
</div>
<p>Before continuing, it would be good to define some terms that we
will be using:</p>
<div class="variablelist">
<dl>
<dt><span class="term">character</span></dt>
<dd>
<p>is beyond the usual term of an alphabetic character, used to
create words, a character is any atomic component with semantic
significance, thus including numbers and punctuation<sup>[<a name=
"d0e69" href="#ftn.d0e69" id="d0e69">2</a>]</sup>.</p>
</dd>
<dt><span class="term">character set</span></dt>
<dd>
<p>is a set of such characters that can be used together to create
words and sentences in a particular language. For example, the
Latin character set, or the Cyrillic character set.</p>
</dd>
<dt><span class="term">a coded character set</span></dt>
<dd>
<p>is a character set and its associated (numeric) codes. For
instance, ASCII defines a coded character set, where the Roman
letter &quot;a&quot; is represented by the number 97, &quot;b&quot; by 98.</p>
</dd>
<dt><span class="term">a code point</span></dt>
<dd>
<p>is a character code within a character set. For example, the
code point to &quot;A&quot; is 0x0041 (dec 65) in ASCII and in Unicode.</p>
</dd>
<dt><span class="term">an encoding</span></dt>
<dd>
<p>is a serialised form of a coded character set, as used for files
or strings. An encoding maps a character onto one or more bytes.
Examples of encoding schemes are UTF-8, Cp1296, ISO-8859-1 and GBK
(Simplified Chinese).</p>
</dd>
<dt><span class="term">UCS (Universal Character Set)</span></dt>
<dd>
<p>is a term commonly used in XML to describe both Unicode and the
ISO/IEC 10646 character systems.</p>
</dd>
<dt><span class="term">A script</span></dt>
<dd>
<p>is the set of characters needed write a set of languages, such
as the Latin script used for most European languages, or the
Devnagiri script used for Indian languages. Some languages, such as
Japanese, use more than one script.</p>
</dd>
</dl>
</div>
</div>
<div class="sect1" lang="en">
<div class="titlepage">
<h2><a name="d0e109" id="d0e109"></a>Universal
Character Set</h2>
</div>
<p>Once we enter into the world of the Universal Character Set we
enter a world of considerable complexity, a complexity that comes
from the sheer number of characters that need to be represented and
also by the need for compatibility for existing standards.</p>
<p>The ISO/IEC 10646 standard proposes both 16 bit and 32 bit
representations of the worlds' character systems, whereas Unicode
(up to V3.1) is a 16-bit representation and as such is a subset of
the full 10646 standard. The Unicode set coincides with the lower
plane of the 10646 standard that has the first two octets set to
zero. This is often called the Basic Multilingual Plane (BMP),
10646 often being thought of as a set of planes, with 256 groups of
256 planes, of which Unicode is identical to the BMP (Plane 0 in
Group 0).</p>
<p>With Unicode V3.1 character codes were added outside of the BMP
(that is, with a value greater than 0xFFFF), marking the move from
a 16 bit system to a 32 bit system, see the next section for
further details, in this section we will restrict ourselves to
Unicode V3.0.</p>
<p>The Unicode character set can be divided up into the four zones
itemised in Table 1.i</p>
<div class="c2"><img src="/var/uploads/journals/resources/table1.png" align=
"middle"></div>
<p>A closer look at the A-Zone gives us this (partial) table of
code values:</p>
<div class="c2"><img src="/var/uploads/journals/resources/a-zone.png" align=
"middle"></div>
<p>The table continues in similar fashion for all the other
alphabets, each script/alphabet having its own section of the code.
As can be seen, the size of the block for each language varies as
necessary.</p>
<p>All the codes blocks mentioned so far have mapped onto various
international character sets, but there are some codes in the zones
above (zone table) that don't. In particular, the surrogate pairs
(in Zone O) and the private use area (in Zone R)do not, directly,
contain any characters.</p>
<p>The surrogate pair codes are important but currently not in
widespread use. A standard, 16-bit code point can access 65,535
different characters in theory, and when it was realized that this
was not enough, then a set of code points, called the <i class=
"firstterm">surrogate pairs</i>, were created. There are two sets,
the <i class="firstterm">low surrogate</i>, from 0xD800 - 0xDBFF,
and the <i class="firstterm">high surrogate</i>, from 0xDC00 -
0xDFFF. Low surrogate values between 0xDB80 and 0xDBFF are reserved
for private use. As the name implies, the surrogate pairs come in
pairs but they are treated as a single code point that maps to the
range 0x100000 and 0x10FFFF (<span class="emphasis"><em>the
supplementary code points</em></span>). How this works in practice
will become clear when we discuss character encoding schemes.</p>
<p>The mapping is done using the following formulas:</p>
<p>(S = Supplementary, H = High surrogate, L = Low surrogate)</p>
<div class="literallayout">
<p><tt class="literal">S = (H - 0xD800) * 0x0400 + (L-0xDC00) +
0x10000 <br>
H = (S - 0x10000) / 0x0400 + 0xD800<br>
L = (S - 0x10000)  mod 0x0400 + 0xDC00</tt></p>
</div>
<p>For example, the Old Italic Number 5 (looks like an inverted
'V') has code point 0x10321, which would give the two surrogate
pairs 0xD800 (High) and 0xDF21 (Low).</p>
<p>From this it can be seen that the surrogate pairs add over one
million more characters to the Unicode code set, all above 0x10000.
Typically though, even East Asian texts contain less than 1% of
their characters as surrogate pairs. Windows XP supports surrogate
pairs and Java 1.4 will also.</p>
</div>
<div class="sect1" lang="en">
<div class="titlepage">
<h2><a name="d0e153" id="d0e153"></a>Unicode V3.1
additions</h2>
</div>
<p>As mentioned previously, Unicode V3.1 is the first of the
Unicodes to describe characters of more than 16 bits, using the
surrogate pairs described above. In fact, it includes an additional
44,946 encoded characters!</p>
<p>These characters are encoded outside of the BMP (with code
points &gt; 0x10000), as follows:</p>
<div class="itemizedlist">
<ul type="disc">
<li>
<p>Supplementary Multilingual Plane (SMP) - 0x10000...0x1FFFF</p>
</li>
<li>
<p>Supplementary Ideographic Plane (SIP) - 0x20000..0x2FFFF</p>
</li>
<li>
<p>Supplementary Special Purpose Plane (SSP) -
0xE0000...0xEFFFF</p>
</li>
</ul>
</div>
<p>SMP contains some historic scripts and more symbols, mainly
mathematical and musical.</p>
<p>SIP contains a very large collection of Han ideographs.</p>
<p>SSP contains a set of tag characters.</p>
<p>To put things in a kind of perspective, Unicode V3.1 describes
94,140 encoded characters, of which 70,207 are Han ideographs.</p>
</div>
<div class="sect1" lang="en">
<div class="titlepage">
<h2><a name="d0e178" id="d0e178"></a>Character
Encoding</h2>
</div>
<p>Given a 16 bit character set, the simplest way to store them on
a disk or send them down the wire would be as 16 bit values. This
straight forward method of encoding is called UTF-16 (for Unicode
Transformation Format). Each character code 0xFFFF and below is
stored as a single 16 bit value. Those values above 0x10000 are
represented using the surrogate pairs. Here is a typical
representation of the string &quot;Der L&ouml;wen&quot;:</p>
<div class="literallayout">
<p><tt class="literal"> D     e     r           L     &ouml;     w 
   e     n<br>
&quot;44 00 65 00 72 00 20 00 4C 00 F6 00 77 00 65 00 6E 00&quot;</tt></p>
</div>
<p>All 'normal' Latin characters; except for the o-umlaut, which
has the code point of 0xF6, greater than the maximum 7-bit ASCII
character value 0x7F.</p>
<p>Clearly, using a 16-bit character set and encoding is an
excellent way to store all the worlds languages, and a few
non-languages as well. It has two major drawbacks though, firstly,
using 16-bits per character instead of the earlier 8 bits will
double the size of a text file. And given that 90% of the text in
the world (at least on the Internet) can be easily handled with
8-bits, it would seem a bit wasteful to double the size of all
files. And secondly, old legacy files cannot work with a 16-bit
application unless converted to 16 bit Unicode.</p>
<p>For this reason another character encoding is also defined,
UTF-8.</p>
<p>UTF-8 uses 8 bit values to store Unicode characters. All
characters below 0x007F are stored in an 8 bit value, characters
between 0x0080 and 0x07FF are stored in a 16 bit value, those
between 0x0800 and 0xFFFF are stored in a 24 bit value and those
above 0x10000 are stored in 32 bit values. See the table below for
finer details.</p>
<p>UTF-8 encoding solves both the problems mentioned above, as all
current ASCII files will not change their encoding, the code points
below 0x007F stay unchanged. Here is &quot;Der L&ouml;wen&quot; again:</p>
<div class="literallayout">
<p><tt class="literal"> D  e  r     L  &ouml;     w  e  n<br>
&quot;44 65 72 20 4c c3 b6 77 65 6e&quot;</tt></p>
</div>
<p>Note the &quot;<tt class="literal">c3 b6</tt>&quot; value that represents
the o-umlaut character.</p>
<p>UTF-8 encoding will keep the size of current ASCII files the
same, files that contain some extended ASCII values will increase
in size proportionately.</p>
<p>There is also a character encoding called UTF-32, which as you
can guess is a 4 byte (32 bit) representation of the character
codes. It is not, as far as I know, in general use and we will
ignore it for the rest of the article. It is the same as UTF-16
with the first two bytes set to 0x00 and it does not need to have a
surrogate pairs section.</p>
<p>So which encoding is the best to use? It depends on what
characters the source file contains. Files with a lot of ASCII will
be better off in UTF-8. If they contain a lot of extended ASCII
they may double in size and if they contain a lot of non-Latin
extended characters, the file could end up three or four times
larger. A UTF-16 encoded file will always be double the size unless
it consists mainly of surrogate pairs (an unlikely occurrence at
present) in which case it will be up to four times the size.</p>
<p>There are of course, many other character encodings in general
use, the most common on Windows platforms being Windows 1252. Many
programs assume that 1252 is the same as ISO-8859-1 but it is not,
1252 defines an extra 34 characters in addition to those from
ISO-8859-1. If there is a possibility that data will be used on
other platforms, make sure that the program is really saving in
ISO-8859-1 format and not Windows 1252. Macintosh users will
probably be familiar with MacRoman encoding. Many Eastern languages
use Shift-JIS or Big5 encodings. However, most XML parsers will not
understand these encodings.</p>
</div>
<div class="sect1" lang="en">
<div class="titlepage">
<h2><a name="d0e212" id="d0e212"></a>XML and
Unicode</h2>
</div>
<p>A parser complying with the XML specification must, at the
least, understand UTF-8 and UTF-16 encodings. Most parsers will
understand other encodings as well. Expat will understand UTF-8,
UTF-16, ISO-8859-2 and US-ASCII 'out of the box' and can be
extended to other formats. Xerces-C++ will understand the above
encodings and adds UCS-4 (32 bit values), EBCDIC (code pages IBM037
and IBM1140), ISO-8859-1 and Windows-1252. The IBM parser, XML4C
(based on Xerces), understands a further 15 encodings.</p>
<p>Most programmers are familiar with the BigEndian/LittleEndian
differences in microprocessors, well the same differences exist in
Unicode encodings, specifically with UTF-16, which can be BigEndian
or LittleEndian in the same manner as microprocessor instructions
can. The character &quot;e&quot; (65) would be represented as 0x65 0x00 in
LittleEndian format and 0x00 0x65 in BigEndian format. To inform
the parser which format the file is in, the file starts with a
<i class="firstterm">byte order mark</i> (BOM).</p>
</div>
<div class="sect1" lang="en">
<div class="titlepage">
<h2><a name="d0e222" id="d0e222"></a>The Byte
Order Mark and Parsing</h2>
</div>
<p>All XML files start with:</p>
<p><tt class="literal">&lt;?xml version&quot;1.0&quot;
encoding=&quot;something&quot;&gt;</tt></p>
<p>If the encoding is missing, then UTF-8 is assumed.</p>
<p>The question naturally arises, how does the parser start reading
the file to reach the encoding part of the header? If the above
section of code is in UTF-8, then the encoding part starts at
position 20, if it is in UTF-16, it will be at position 40. If the
file is BigEndian it needs to be read differently than if it is
LittleEndian. Its a chicken and egg problem, so the parser starts a
file with a little testing of its own, like so:</p>
<div class="itemizedlist">
<ul type="disc">
<li>
<p>If the first two bytes are '3C 3F' (&quot;&lt;?&quot;) then standard UTF-8
encoding is assumed</p>
</li>
<li>
<p>If the first three bytes are 'EF BB BF' (BOM/UTF-8) then
standard UTF-8 encoding is assumed</p>
</li>
<li>
<p>If the first two bytes are 'FF FE' (BOM/Little) then UTF-16
LittleEndian is assumed</p>
</li>
<li>
<p>If the first two bytes are 'FE FF' (BOM/Big) then UTF-16
BigEndian is assumed.</p>
</li>
<li>
<p>If the first four bytes are '00 00 FF FE' (BOM/Little) then
UTF-32 LittleEndian is assumed</p>
</li>
<li>
<p>If the first four bytes are '00 00 FE FF' (BOM/Big) the UTF-32
BigEndian is assumed</p>
</li>
<li>
<p>Perform one or two other checks (EBCDIC encoding for
example).</p>
</li>
</ul>
</div>
<p><tt class="literal">&quot;FF FE&quot;</tt> or <tt class="literal">&quot;FE
FF&quot;</tt> are called the <i class="firstterm">byte order mark</i>
and indicate that the file is in UTF-16 format and whether it is a
LittleEndian file or a BigEndian file. In the Unicode character
set, 0xFEFF represents a 'zero width non printing space' so will
not affect the printing of the file, and 0xFFFE is a non-existent
character.</p>
<p>The actual checks performed will depend on the parser
implementation, but it will be something along the lines above.
Assuming all goes well, the '<tt class=
"literal">encoding=&quot;GBK&quot;</tt>' (for example) part will be reached
and the actual encoding established.</p>
<p>At this point the parser will have to check whether it can
support the encoding and either continue or report an error.
Throughout the parsing process the parser will be reading
characters and checking for particular codes or combinations. In
UTF-16 the process is reasonably straight forward: every 2 byte
value is a character and can be dealt with as such, with the
exception of values between 0xD800 and 0xDFFF. These indicate the
start of a surrogate pair, if the pair of characters do not form a
valid pair, the parser will indicate an error. Any further action
the parser takes is dependent on the parser used, e.g. converting
it into another format.</p>
<p>In UTF-8, the situation is a little more complicated as the
width of the characters vary. Table 2 will help you understand how
the parser deals with the 8 bit values it reads.</p>
<div class="c2"><img src="/var/uploads/journals/resources/table2.png" align=
"middle"></div>
<p>If a byte is below 0x80, it is a character. If it is between
0xC2 and 0xDF, then fetch another byte, which must be between 0x80
and 0xBF. And so on.</p>
<p>From this it is clear that there are quite a number of illegal
sequences, for instance 0x80 to 0xC1 cannot be a first byte. It has
been cleverly arranged that a reader can 'drop in' on a byte stream
and know which part of a character sequence it is looking at.
Whatever the format of the file being parsed, internally the parser
will be using UTF-8, so the programmer will need to take care of
converting it into something useful for the application like
displaying in the GUI or converting to a text file.</p>
</div>
<div class="sect1" lang="en">
<div class="titlepage">
<h2><a name="d0e280" id="d0e280"></a>Unicode and
C++</h2>
</div>
<p>We'll wrap up this introduction with a short mention of C and
C++ (this being the ACCU journal). It seems like a natural match to
use the <tt class="type">wchar_t</tt> to store Unicode characters
in a program, but its not. The reason being that the <tt class=
"type">wchar_t</tt> can be different sizes on different
architectures. In Windows NT it is 16 bits, under Linux it is 32
bits. It could even be 8 bits on some architectures.</p>
<p>How to program with Unicode in a portable manner is a complex
subject that we will revisit in a further article, for now, its
enough to say that for portability its best to specify either
<tt class="type">unsigned short</tt> for 16 bit Unicode, or
<tt class="type">long</tt> for 32 bit Unicode. The Xerces parser
has a XMLCh type (a <tt class="literal">typedef</tt> for <tt class=
"type">unsigned short</tt>) that is defined for the compiler being
used, Expat uses XML_Char (defined as a <tt class=
"type">char</tt>).</p>
</div>
<div class="sect1" lang="en">
<div class="titlepage">
<h2><a name="d0e308" id="d0e308"></a>Summary</h2>
</div>
<p>I hope this article has given a sufficient background to Unicode
and its use in XML. We'll continue the series with by getting back
to simpler stuff like C++ programming, using a XML parser to read
in files. But its important to understand the various encodings
that need to be dealt with.</p>
</div>
<div class="sect1" lang="en">
<div class="titlepage">
<h2><a name="d0e313" id="d0e313"></a>Further
Reading:</h2>
</div>
<p>XML Internationalisation and Localisation by Yves Savourel (SAMS
publishing) (excellent guide to the various issues with using
Unicode. Although aimed at XML users, it also has pertinent
information for anyone translating progams).</p>
<p>For all the information you could ever want about
internationalization, character sets, encodings and glyphs, pay a
visit to the Unicode Consortium website. <a href=
"http://www.unicode.org" target=
"_top">http://www.unicode.org</a></p>
<p><span class="bold"><b>Latest version of the Unicode
standard:</b></span></p>
<p><a href=
"http://www.unicode.org/unicode/standard/versions/enumeratedversions.html"
target=
"_top">http://www.unicode.org/unicode/standard/versions/enumeratedversions.html</a></p>
<p><span class="bold"><b>The I18N Gurus page:</b></span></p>
<p><a href="http://www.i18ngurus.com/" target=
"_top">http://www.i18ngurus.com/</a></p>
<p>Open directory of links to internationalization (i18n) resources
and related material.</p>
<p><span class="bold"><b>The OpenTag website</b></span></p>
<p><a href="http://www.opentag.com/" target=
"_top">http://www.opentag.com/</a></p>
<p><span class="bold"><b>The very excellent piece by Tom
Jennings:</b></span></p>
<p><a href="http://www.wps.com/texts/codes/index.html" target=
"_top">http://www.wps.com/texts/codes/index.html</a></p>
<p>Annotated history of character codes, which I borrowed heavily
from in the Introduction.</p>
<p><span class="bold"><b>International Character Codes overview
(from 1995):</b></span></p>
<p><a href="http://consult.cern.ch/cnl/215/node45.html" target=
"_top">http://consult.cern.ch/cnl/215/node45.html</a></p>
<p><span class="bold"><b>The following RFC's are of interest in
working with Unicode:</b></span></p>
<p>RFC 2781 - UTF-16, an encoding of ISO 10646</p>
<p>RFC 2279 - UTF-8, a transormation format.</p>
<p>RFC 2152 - UTF-7 A Mail-Safe Transformation Format of
Unicode</p>
</div>
<div class="footnotes"><br>
<hr class="c3" width="100">
<div class="footnote">
<p><sup>[<a name="ftn.d0e33" href="#d0e33" id=
"ftn.d0e33">1</a>]</sup> Strictly speaking we could consider Morse
code a ternary system, consisting of dots, dashes or blank. Or even
base 4 if we consider short blanks and long blanks.</p>
</div>
<div class="footnote">
<p><sup>[<a name="ftn.d0e69" href="#d0e69" id=
"ftn.d0e69">2</a>]</sup> Well, we could argue that the graphics
drawing characters have no semantic meaning.</p>
</div>
</div>
</p>
<p><strong>Notes:</strong>&nbsp;</p>
<p><em>More fields may be available via dynamicdata ..</em></p>
</div>
</channel>
</rss>
