Internationalisation (also known as i18n, because there are 18 characters between the ‘i' and ‘n’ in internationalisation and Americans and British can't agree on using an ‘s’ or a ‘z’) covers a number of areas:
How characters and strings of characters are stored in files and in program memory. This usually involves the use of ‘character sets’ and ‘code pages’ to describe a mapping between actual language characters and some binary storage values.
How to separate messages (informational, error, etc) from code, and how to format various values (dates, times, currencies, etc) in a way appropriate for the target language and country. One apparently simple but overlooked example is how to ‘pluralise’ a message in a language independent way. Eg “Converted 1 row” to “Converted 123 rows”
How to choose the correct font, the direction to display the font in (left to right, right to left, top to bottom, etc), input methods for more complex character sets, etc
This document will deal mostly with the first of the issues above.
The most basic problem with computers is how to store ‘text’ within a computer. Since computer deal almost exclusively with binary numbers, we need to find some way of mapping what we call ‘letters’ or ‘characters’ to these binary numbers. The process of mapping ‘characters’ to ‘numbers’ is done with the concept of a ‘character set’ or ‘code page’. In general, these are all lookup tables, which map each character to a single number.
For the purposes of this document, we’ll consider the beginning of character representation classic ‘ASCII’ characters. There were many methods before this, but ASCII is by far the most widespread standard off which other schemes are based. ASCII is also known as US-ASCII or the ISO standard ISO-646. ASCII maps all Latin alphabet (which also happens to be the English alphabet) numbers, upper-case letters, lower-case letters, numerous symbols and ‘control-codes’ into 128 different values. These 128 values can be represented as a 7 bit binary, 3 digit octal or 2 digit hex number.
Char |
Dec |
Oct |
Hex |
|
Char |
Dec |
Oct |
Hex |
|
Char |
Dec |
Oct |
Hex |
|
Char D |
ec |
Oct |
Hex |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
(nul) |
0 |
0 |
0x00 |
|
(sp) |
32 |
40 |
0x20 |
|
@ |
64 |
100 |
0x40 |
|
` |
96 |
140 |
0x60 |
(soh) |
1 |
1 |
0x01 |
|
! |
33 |
41 |
0x21 |
|
A |
65 |
101 |
0x41 |
|
a |
97 |
141 |
0x61 |
(stx) |
2 |
2 |
0x02 |
|
" |
34 |
42 |
0x22 |
|
B |
66 |
102 |
0x42 |
|
b |
98 |
142 |
0x62 |
(etx) |
3 |
3 |
0x03 |
|
# |
35 |
43 |
0x23 |
|
C |
67 |
103 |
0x43 |
|
c |
99 |
143 |
0x63 |
(eot) |
4 |
4 |
0x04 |
|
$ |
36 |
44 |
0x24 |
|
D |
68 |
104 |
0x44 |
|
d |
100 |
144 |
0x64 |
(enq) |
5 |
5 |
0x05 |
|
% |
37 |
45 |
0x25 |
|
E |
69 |
105 |
0x45 |
|
e |
101 |
145 |
0x65 |
(ack) |
6 |
6 |
0x06 |
|
& |
38 |
46 |
0x26 |
|
F |
70 |
106 |
0x46 |
|
f |
102 |
146 |
0x66 |
(bel) |
7 |
7 |
0x07 |
|
' |
39 |
47 |
0x27 |
|
G |
71 |
107 |
0x47 |
|
g |
103 |
147 |
0x67 |
(bs) |
8 |
10 |
0x08 |
|
( |
40 |
50 |
0x28 |
|
H |
72 |
110 |
0x48 |
|
h |
104 |
150 |
0x68 |
(ht) |
9 |
11 |
0x09 |
|
) |
41 |
51 |
0x29 |
|
I |
73 |
111 |
0x49 |
|
i |
105 |
151 |
0x69 |
(nl) |
10 |
12 |
0x0a |
|
* |
42 |
52 |
0x2a |
|
J |
74 |
112 |
0x4a |
|
j |
106 |
152 |
0x6a |
(vt) |
11 |
13 |
0x0b |
|
+ |
43 |
53 |
0x2b |
|
K |
75 |
113 |
0x4b |
|
k |
107 |
153 |
0x6b |
(np) |
12 |
14 |
0x0c |
|
, |
44 |
54 |
0x2c |
|
L |
76 |
114 |
0x4c |
|
l |
108 |
154 |
0x6c |
(cr) |
13 |
15 |
0x0d |
|
- |
45 |
55 |
0x2d |
|
M |
77 |
115 |
0x4d |
|
m |
109 |
155 |
0x6d |
(so) |
14 |
16 |
0x0e |
|
. |
46 |
56 |
0x2e |
|
N |
78 |
116 |
0x4e |
|
n |
110 |
156 |
0x6e |
(si) |
15 |
17 |
0x0f |
|
/ |
47 |
57 |
0x2f |
|
O |
79 |
117 |
0x4f |
|
o |
111 |
157 |
0x6f |
(dle) |
16 |
20 |
0x10 |
|
0 |
48 |
60 |
0x30 |
|
P |
80 |
120 |
0x50 |
|
p |
112 |
160 |
0x70 |
(dc1) |
17 |
21 |
0x11 |
|
1 |
49 |
61 |
0x31 |
|
Q |
81 |
121 |
0x51 |
|
q |
113 |
161 |
0x71 |
(dc2) |
18 |
22 |
0x12 |
|
2 |
50 |
62 |
0x32 |
|
R |
82 |
122 |
0x52 |
|
r |
114 |
162 |
0x72 |
(dc3) |
19 |
23 |
0x13 |
|
3 |
51 |
63 |
0x33 |
|
S |
83 |
123 |
0x53 |
|
s |
115 |
163 |
0x73 |
(dc4) |
20 |
24 |
0x14 |
|
4 |
52 |
64 |
0x34 |
|
T |
84 |
124 |
0x54 |
|
t |
116 |
164 |
0x74 |
(nak) |
21 |
25 |
0x15 |
|
5 |
53 |
65 |
0x35 |
|
U |
85 |
125 |
0x55 |
|
u |
117 |
165 |
0x75 |
(syn) |
22 |
26 |
0x16 |
|
6 |
54 |
66 |
0x36 |
|
V |
86 |
126 |
0x56 |
|
v |
118 |
166 |
0x76 |
(etb) |
23 |
27 |
0x17 |
|
7 |
55 |
67 |
0x37 |
|
W |
87 |
127 |
0x57 |
|
w |
119 |
167 |
0x77 |
(can) |
24 |
30 |
0x18 |
|
8 |
56 |
70 |
0x38 |
|
X |
88 |
130 |
0x58 |
|
x |
120 |
170 |
0x78 |
(em) |
25 |
31 |
0x19 |
|
9 |
57 |
71 |
0x39 |
|
Y |
89 |
131 |
0x59 |
|
y |
121 |
171 |
0x79 |
(sub) |
26 |
32 |
0x1a |
|
: |
58 |
72 |
0x3a |
|
Z |
90 |
132 |
0x5a |
|
z |
122 |
172 |
0x7a |
(esc) |
27 |
33 |
0x1b |
|
; |
59 |
73 |
0x3b |
|
[ |
91 |
133 |
0x5b |
|
{ |
123 |
173 |
0x7b |
(fs) |
28 |
34 |
0x1c |
|
< |
60 |
74 |
0x3c |
|
\ |
92 |
134 |
0x5c |
|
| |
124 |
174 |
0x7c |
(gs) |
29 |
35 |
0x1d |
|
= |
61 |
75 |
0x3d |
|
] |
93 |
135 |
0x5d |
|
} |
125 |
175 |
0x7d |
(rs) |
30 |
36 |
0x1e |
|
> |
62 |
76 |
0x3e |
|
^ |
94 |
136 |
0x5e |
|
~ |
126 |
176 |
0x7e |
(us) |
31 |
37 |
0x1f |
|
? |
63 |
77 |
0x3f |
|
_ |
95 |
137 |
0x5f |
|
(del) |
127 |
177 |
0x7f |
This is purely the Latin/English alphabet. There are no basic European characters (vowels with accents, graves, umlauts, etc). Also the lowest 32 numbers are designated as special ‘control characters’. ASCII was designed in the days of tele-types and output printer terminals. These codes were used to control the way the printer worked (eg (cr) = 0x0d = ‘cartridge return’ = ‘return the print cartridge to the left side of the page’). Only a few of these are used today, namely ‘cartridge return (cr)’, ‘line feed (lf)’ and ‘horizontal tab (ht)’. These are generally used only in files to tell software that reads the files where lines of data actually break, rather than affecting the actual physical output of the file.
While ASCII started because historically quite a few systems had 7 bit bytes, such systems are very rare today and almost all machines deal exclusively in 8 bit bytes. Because of this, almost all extensions to ASCII involve using the high 8th bit for various ‘tricks’ to extend the range of possible characters.
In general we can break down all these extensions into three types:
In SBCS, every character is represented by a single 8-bit byte. In MBCS, a single byte might act as an ‘escape’ value, where several other bytes might be required to determine the entire character being represented. In mode switching character sets, a special 'escape sequence' is used to switch between modes. The interpretation of byte values is completely changed by what mode you are in.
By their nature, SBCS systems are easier for most programs to deal with because:
For MBCS systems, each of these is considerably more complex
Moded systems are particularly difficult to deal with, because you have to parse from the start as you can't tell how to even interpret the data you see without knowing what the last escape sequence was. The most common moded character set is iso-2022-jp for Japanese. It's also particularly interesting as it only uses 7-bit characters, the escape sequence tells you whether to interpret the data as ASCII, or as Japanese data.
For almost all other character encoding systems, all characters in the 0-127 range are their equivalent ASCII counterparts. This means that in most cases, a plain ASCII string is also a valid string in just about every character set.
A common set of ‘character sets’ in use is the ISO-8859-X ones. These character sets have the following characteristics (from here note that this particular document is fairly old and out of date):
Each character set is designed for a particular language or set of languages. These are:
8859-1 |
Europe, Latin America |
8859-2 |
Eastern Europe |
8859-3 |
SE Europe/miscellaneous (Esperanto, Maltese, etc.) |
8859-4 |
Scandinavia/Baltic (mostly covered by 8859-1 also) |
8859-5 |
Cyrillic |
8859-6 |
Arabic |
8859-7 |
Greek |
8859-8 |
Hebrew |
8859-9 |
Latin5, same as 8859-1 except for Turkish instead of Icelandic |
8859-10 |
Latin6, for Lappish/Nordic/Eskimo languages |
Because of the SBCS nature of these character sets, there is no ISO-8859 char set for languages with large character sets (eg the Asian languages – Japanese, Korean, etc)
Because of the nature of Japanese, Korean, Chinese, Vietnamese, etc (collectively often-called CJKV), these languages need a large range of values in which to represent all their characters. In general 1 byte (256 values) is not nearly enough so the different encoding schemes for these languages are all MBCS. The most well known example is ‘Big5’, an encoding scheme to represent Chinese characters, and SHIFT-JIS for Japanese characters.
In general, each encoding is designed with a particular language in mind. One problem with the ISO-8859-X series is that you can’t tell which one is being used without being explicitly told. Given a file, does it have 8859-6 Arabic characters or 8859-7 Greek characters? Also what if you want Arabic and Greek characters in the same string or file, let alone 8859-6 Arabic characters and Big5 Chinese characters?
The attempted solution to these issues is Unicode. From their web-site, this gives a good summary of their aim and the current situation
Fundamentally, computers just deal with numbers. They store letters and other characters by assigning a number for each one. Before Unicode was invented, there were hundreds of different encoding systems for assigning these numbers. No single encoding could contain enough characters: for example, the European Union alone requires several different encodings to cover all its languages. Even for a single language like English no single encoding was adequate for all the letters, punctuation, and technical symbols in common use.
These encoding systems also conflict with one another. That is, two encodings can use the same number for two different characters, or use different numbers for the same character. Any given computer (especially servers) needs to support many different encodings; yet whenever data is passed between different encodings or platforms, that data always runs the risk of corruption.
Unicode provides a unique number for every character, no matter what the platform, no matter what the program, no matter what the language.
Thus Unicode can be considered the ultimate character set, encompassing every character in every language. Unicode has a huge amount of support (from companies and in software infrastructure) and basically can be considered the standard way forward to deal with character set encoding issues.
As is always the way, Unicode was designed to be a clean sweep and support a nice clean mapping from the start. As is usually the way, things didn’t quite go as planned. It was initially thought that 65,536 characters would be enough to encompass all characters. Thus, initial Unicode implementations treated each character as an ‘unsigned 16 bit integer’, thus allowing us to go back to the ‘1 byte = 1 character’ rule, but instead ‘1 word = 1 character’ for any language. This would again make it easy to find the number of characters in a string, move around strings, etc.
Unfortunately, after a while it was realised those 65,536 characters would not be enough. Thus Unicode broke into several different definitions.
UTF-16 consists of a series of 16 bit unsigned integers. The following ranges are defined:
The surrogates are the interesting part, that allow the representation of code point greater than 0xD7FF in 2 16-bit integer values. Basically a value in the 0xD800 to 0xDBFF range must ALWAYS be followed by a value in the 0xDC00 to 0xDFFF. This makes it easy to scan a string for a low or high surrogate and either look 1 word forward or 1word backward to find the corresponding surrogate.
Looking at the surrogate values we see that the range 0xD800 -> 0xDBFF has 1024 values, ditto for 0xDC00 -> 0xDFFF. Thus the surrogates can represent 1024 * 1024 possible value combinations, which covers the 1,048,576 possible different Unicode code points.
UTF-8 is the most complex of the unicode encodings, but is also commonly used in the Unix world and with XML and HTML. The reason for this is that many existing programs that deal with strings in only an opaque way (eg. just move around this ‘NULL terminated array of 8 bit chars’) will work fine with UTF-8. There’s no need to change to using word pointers or the like.
As an added advantage, UTF-8 is backwards compatible with ASCII as well. That is, all byte values < 0x80 represent ASCII characters, so you can transition your code internally from using ASCII to UTF-8 without having to make a big hit change.
You’ll probably encounter these terms with regard to Unicode encoding as well. UCS-2 is just UTF-16 without the surrogate pairs. UCS-4 is the same as UTF-32 as far as I can tell.
The following really good summary is copied directly from:
http://www-106.ibm.com/developerworks/library/utfencodingforms/
Programming using UTF-8 and UTF-16 is much more straightforward than with other mixed-width character encodings. For each code point, they have either a singleton form or a multi-unit form. With UTF-16, there is only one multi-unit form, having exactly two code units. With UTF-8, the number of trailing units is determined by the value of the lead unit: thus you can't have the same lead unit with a different number of trailing units. Within each encoding form, the values for singletons, for lead units, and for trailing units are all completely disjoint. This has crucial implications for implementations:
- No overlap.
If you search for string A in a string B, you will never get a false match on code points. You never need to convert to code points for string searching. False matches never occur because the end of one sequence can never be the same as the start of another sequence. Overlap is one of the biggest problems with common multi-byte encodings like Shift-JIS. All of the UTFs avoid this problem.
- Determinate boundaries.
If you randomly access into text, you can always determine the nearest code-point boundaries with a small number of machine instructions.
- Pass-through.
Processes that don't look at particular character values don't need to know about the internal structure of the text.
- Simple iteration.
Getting the next or previous code point is straightforward, and only takes a small number of machine instructions.
- Slow indexing.
Except in UTF-32, it is inefficient to find code unit boundaries corresponding to the n th code point, or to find the code point offset containing the n th code unit. Both involve scanning from the start of the text.
- Frequency.
Because the proportion of world text that needs surrogate space is extremely small, UTF-16 code should always be optimized for the single code unit. With UTF-8, it is probably worth optimizing for the single-unit case also, but not if it slows down the multi-unit case appreciably.
UTF-8 has one additional complication, called the shortest form requirement. Of the possible sequences in Table 1 that could represent a code point, the Unicode Standard requires that the shortest possible sequence be generated. When mapping back from code units to code points, however, implementations are not required to check for the shortest form. This problem does not occur in UTF-16.
Most systems will be upgrading their UTF-16 support for surrogates in the next year or two. This upgrade can use a phased approach. From a market standpoint, the only interesting surrogate-space characters expected in the near term are an additional set of CJK ideographs used for Japan, China, and Korea. If a system is already internationalized, most of the operations on the system will work sufficiently well that minor changes will suffice for these in the near term.
Probably the two hardest problems with Unicode strings are normalisation and collation.
Roughly speaking normalisation is required because Unicode sometimes allows multiple representations of a single string. For instance, there is a code point for ‘á’ (the letter ‘a’ with an accent), but also a code point for ‘a’ and a code point for a zero-width ‘´’ accent character. Thus the single code point á and the two code points a´ together should be considered as the same character for comparison purposes.
http://www-106.ibm.com/developerworks/library/internationalization-support.html
Multiple representations
Unicode includes the concept of a combining character, which is a character that (generally) modifies the character before it in some way rather than showing up as a character on its own. For example, the letter é can be represented using a regular letter e followed by a combining acute-accent character. This occupies two storage positions in memory. Unicode defines these characters to allow flexibility in its use.
If you need a certain type of accented character, Unicode can give you the base character and the accent rather than having to assign a whole new code-point value to the actual combination you want to display. This greatly expands the effective number of characters that Unicode can encode. In some cases, such as Korean, "characters" are broken up into smaller units that can be combined into the actual characters. This was done to save on code-point assignments.
In many cases, including the two examples above, Unicode actually does have a single code point representing the entire unit. The letter é can be represented using its own single code-point value, and all Korean syllables (including many that don't naturally occur in Korean) have single code-point values. This is because most constituencies prefer the precomposed versions. Storing é ‘s two characters would both complicate processing and increase storage size. This introduces a significant pro-English bias into what's supposed to be an international standard. Similarly, requiring 6 bytes for each Korean syllable when only 2 bytes are required for each Japanese ideograph introduces an anti-Korean bias.
As a result, many characters in Unicode have multiple possible representations. In fact, many characters that don't have a specific code point in Unicode (for example, many letters with two diacritical marks on them) have multiple sequences of Unicode characters that can represent them. This can make two strings that appear to be the same to the user appear to be different to the computer.
The Unicode standard generally requires that implementations treat alternative "spellings" of the same sequence of characters as identical. Unfortunately, this can be impractical.
Most processes just want to do bitwise equality on two strings. Doing anything else imposes a huge overhead both in executable size and in performance. But because many of these processes can't control the source of the text, they're stuck. The traditional way of handling this is normalization -- picking a preferred representation for every character or sequence of characters that can be represented multiple ways in Unicode. The Unicode standard does this by declaring a preferred ordering for multiple accent marks on a single base character and by declaring a "canonical decomposition" into multiple characters for every single character that can also be represented as two or more characters.
Four normalization forms
The Unicode standard also defines a set of "compatibility decompositions" for characters that can be decomposed into other characters but only with the loss of some information. Newer versions of the standard also define "compatibility compositions" and "canonical compositions." This actually gives you a choice of four "normalized forms" for a string.
A program can make bitwise equality comparison work right by declaring that all strings must be in a particular normalized form. The program has the choice of requiring that all text fed to it be normalized in order to work right (delegating the work to an outside entity), or normalizing things itself. The World Wide Web Consortium ran into this very problem and solved it by requiring all applications that produce text on the Internet to produce it in normalized form. Software that merely transfers text from one place to another or displays it can choose to normalize it again on receipt, but doesn't have to -- if everybody's followed the rules, it's already normalized.
Collation has to do with the property of ordering of strings. In English and most western languages, we understand well that the alphabet has an order, such that B comes before C but after A. However even within that there are more difficult questions.
1. Are ‘a’ and ‘A’ equivalent?
2. What about á and a?
3. How do we order symbols?
Typically ordering has basically been a binary comparison of ASCII text, or ASCII text converted all to one case (lower or upper).
Obviously this becomes much harder for Unicode. How do you give a consistent ordering to every possible character in all languages? The Unicode solution is given in this document, the Unicode Collation Algorithm. http://www.unicode.org/unicode/reports/tr10/
The general process for ordering Unicode strings is a bit different from the usual ASCII ordering method. Basically with ASCII you can do a byte by byte comparison. Because of the complexity of Unicode, the algorithm is designed as follows:
Briefly stated, the Unicode Collation Algorithm takes an input Unicode string and a Collation Element Table, containing mapping data for characters. It produces a sort key, which is an array of unsigned 16-bit integers. Two or more sort keys so produced can then be binary-compared to give the correct comparison between the strings for which they were generated.
1. For each string to compare, the algorithm generates an array of 16-bit word values
2. For each string, you can then do a binary compare on the generated word arrays to determine the relative ordering of two strings
The algorithm also specifies that you can have different ‘levels’ of ordering. For latin script, the first three correspond roughly to:
1. alphabetic ordering (eg. AbC == abc, ábc == Abc)
2. diacritic ordering (eg. ábc != abc)
3. case ordering (eg Abc != abc)
The collation algorithm is reasonably complex, and you’ll probably want to use an already written library to do this for you.
One of the best libraries for dealing with Unicode is the IBM International components for Unicode (http://oss.software.ibm.com/developerworks/opensource/icu/project/). This library is very extensive and covers much more than what I’ve discussed in this document (eg. Most of the stuff in item 2 at the start of this document: calendar, date, time, currency, numbers, line/word/sentence breaks, etc).
With regard to what I’ve discussed so far, it includes routines for trans-coding between different character sets/code pages, as well as being able to sort strings based on a collation order.
Windows NT has native Unicode support. Unfortunately this means UCS-2 support. UTF-16 (surrogate pairs) support is limited (see this link http://msdn.microsoft.com/library/default.asp?url=/library/en-us/intl/unicode_192r.asp). The Windows API does not support UTF-8 directly, it supports only ASCII or UCS-2. Thus to use Unicode UTF-8 strings with the Windows, you first have to convert the strings to UTF-16. You can do this using the calls WideCharToMultiByte and MultiByteToWideChar. These both accept a code page of ‘CP_UTF8’ to converting from/to UTF-8.
Lots of the included links above are worth reviewing first. Here are some more:
Adding internationalization support to the base standard for JavaScript
Lessons learned in internationalizing the ECMAScript standard
http://www-106.ibm.com/developerworks/library/internationalization-support.html