Internationalisation

Internationalisation overview

Internationalisation (also known as i18n, because there are 18 characters between the ‘i' and ‘n’ in internationalisation and Americans and British can't agree on using an ‘s’ or a ‘z’) covers a number of areas:

Character storage issues

How characters and strings of characters are stored in files and in program memory. This usually involves the use of ‘character sets’ and ‘code pages’ to describe a mapping between actual language characters and some binary storage values.

Language and localisation issues

How to separate messages (informational, error, etc) from code, and how to format various values (dates, times, currencies, etc) in a way appropriate for the target language and country. One apparently simple but overlooked example is how to ‘pluralise’ a message in a language independent way. Eg “Converted 1 row” to “Converted 123 rows”

Display issues

How to choose the correct font, the direction to display the font in (left to right, right to left, top to bottom, etc), input methods for more complex character sets, etc

What this document covers

This document will deal mostly with the first of the issues above.

Character sets

The basic problem or storing characters

The most basic problem with computers is how to store ‘text’ within a computer. Since computer deal almost exclusively with binary numbers, we need to find some way of mapping what we call ‘letters’ or ‘characters’ to these binary numbers. The process of mapping ‘characters’ to ‘numbers’ is done with the concept of a ‘character set’ or ‘code page’. In general, these are all lookup tables, which map each character to a single number.

In the beginning

For the purposes of this document, we’ll consider the beginning of character representation classic ‘ASCII’ characters. There were many methods before this, but ASCII is by far the most widespread standard off which other schemes are based. ASCII is also known as US-ASCII or the ISO standard ISO-646. ASCII maps all Latin alphabet (which also happens to be the English alphabet) numbers, upper-case letters, lower-case letters, numerous symbols and ‘control-codes’ into 128 different values. These 128 values can be represented as a 7 bit binary, 3 digit octal or 2 digit hex number.

Char	Dec	Oct	Hex	Char	Dec	Oct	Hex	Char	Dec	Oct	Hex	Char D	ec	Oct	Hex

(nul)	0	0	0x00	(sp)	32	40	0x20	@	64	100	0x40	`	96	140	0x60
(soh)	1	1	0x01	!	33	41	0x21	A	65	101	0x41	a	97	141	0x61
(stx)	2	2	0x02	"	34	42	0x22	B	66	102	0x42	b	98	142	0x62
(etx)	3	3	0x03	#	35	43	0x23	C	67	103	0x43	c	99	143	0x63
(eot)	4	4	0x04	$	36	44	0x24	D	68	104	0x44	d	100	144	0x64
(enq)	5	5	0x05	%	37	45	0x25	E	69	105	0x45	e	101	145	0x65
(ack)	6	6	0x06	&	38	46	0x26	F	70	106	0x46	f	102	146	0x66
(bel)	7	7	0x07	'	39	47	0x27	G	71	107	0x47	g	103	147	0x67
(bs)	8	10	0x08	(	40	50	0x28	H	72	110	0x48	h	104	150	0x68
(ht)	9	11	0x09	)	41	51	0x29	I	73	111	0x49	i	105	151	0x69
(nl)	10	12	0x0a	*	42	52	0x2a	J	74	112	0x4a	j	106	152	0x6a
(vt)	11	13	0x0b	+	43	53	0x2b	K	75	113	0x4b	k	107	153	0x6b
(np)	12	14	0x0c	,	44	54	0x2c	L	76	114	0x4c	l	108	154	0x6c
(cr)	13	15	0x0d	-	45	55	0x2d	M	77	115	0x4d	m	109	155	0x6d
(so)	14	16	0x0e	.	46	56	0x2e	N	78	116	0x4e	n	110	156	0x6e
(si)	15	17	0x0f	/	47	57	0x2f	O	79	117	0x4f	o	111	157	0x6f
(dle)	16	20	0x10	0	48	60	0x30	P	80	120	0x50	p	112	160	0x70
(dc1)	17	21	0x11	1	49	61	0x31	Q	81	121	0x51	q	113	161	0x71
(dc2)	18	22	0x12	2	50	62	0x32	R	82	122	0x52	r	114	162	0x72
(dc3)	19	23	0x13	3	51	63	0x33	S	83	123	0x53	s	115	163	0x73
(dc4)	20	24	0x14	4	52	64	0x34	T	84	124	0x54	t	116	164	0x74
(nak)	21	25	0x15	5	53	65	0x35	U	85	125	0x55	u	117	165	0x75
(syn)	22	26	0x16	6	54	66	0x36	V	86	126	0x56	v	118	166	0x76
(etb)	23	27	0x17	7	55	67	0x37	W	87	127	0x57	w	119	167	0x77
(can)	24	30	0x18	8	56	70	0x38	X	88	130	0x58	x	120	170	0x78
(em)	25	31	0x19	9	57	71	0x39	Y	89	131	0x59	y	121	171	0x79
(sub)	26	32	0x1a	:	58	72	0x3a	Z	90	132	0x5a	z	122	172	0x7a
(esc)	27	33	0x1b	;	59	73	0x3b	[	91	133	0x5b	{	123	173	0x7b
(fs)	28	34	0x1c	<	60	74	0x3c	\	92	134	0x5c	\|	124	174	0x7c
(gs)	29	35	0x1d	=	61	75	0x3d	]	93	135	0x5d	}	125	175	0x7d
(rs)	30	36	0x1e	>	62	76	0x3e	^	94	136	0x5e	~	126	176	0x7e
(us)	31	37	0x1f	?	63	77	0x3f	_	95	137	0x5f	(del)	127	177	0x7f

This is purely the Latin/English alphabet. There are no basic European characters (vowels with accents, graves, umlauts, etc). Also the lowest 32 numbers are designated as special ‘control characters’. ASCII was designed in the days of tele-types and output printer terminals. These codes were used to control the way the printer worked (eg (cr) = 0x0d = ‘cartridge return’ = ‘return the print cartridge to the left side of the page’). Only a few of these are used today, namely ‘cartridge return (cr)’, ‘line feed (lf)’ and ‘horizontal tab (ht)’. These are generally used only in files to tell software that reads the files where lines of data actually break, rather than affecting the actual physical output of the file.

Extensions to ASCII

While ASCII started because historically quite a few systems had 7 bit bytes, such systems are very rare today and almost all machines deal exclusively in 8 bit bytes. Because of this, almost all extensions to ASCII involve using the high 8^th bit for various ‘tricks’ to extend the range of possible characters.

In general we can break down all these extensions into three types:

Single byte character sets (SBCS)
Multi-byte character sets (MBCS)
Mode switching character sets

In SBCS, every character is represented by a single 8-bit byte. In MBCS, a single byte might act as an ‘escape’ value, where several other bytes might be required to determine the entire character being represented. In mode switching character sets, a special 'escape sequence' is used to switch between modes. The interpretation of byte values is completely changed by what mode you are in.

By their nature, SBCS systems are easier for most programs to deal with because:

The number of characters in a string is equal to the number of bytes in the string
You can easily move to an absolute character by jumping to a given byte offset
You can easily move to a relative character by adding/subtracting the number of characters to move from a byte pointer

For MBCS systems, each of these is considerably more complex

To determine the number of characters in a string, you have to parse the entire string, working out what are escape characters and how many characters there are to swallow after the escape char
In general, to go to any absolute character, you have to start from the start and parse forward to the character you want. Some systems employ "caching" mechanisms to help make this faster.
It’s almost always impossible to go backwards through a string as you can’t interpret a particular byte without knowing the escape code that preceded it (possibly 1 or several bytes back). UTF-8 is an exception here, in that for multi-byte characters, it's clearly defined which bytes can appear in which positions, so you can always tell what byte position within a character you are in.
It’s possible to have ‘invalid’ strings. A character set might define 0x80 to be an escape code to specify that you should also look at the next byte, which must be in the range 0xA0-0xF0. If the next byte isn’t in this range, then you have an invalid sequence of bytes.

Moded systems are particularly difficult to deal with, because you have to parse from the start as you can't tell how to even interpret the data you see without knowing what the last escape sequence was. The most common moded character set is iso-2022-jp for Japanese. It's also particularly interesting as it only uses 7-bit characters, the escape sequence tells you whether to interpret the data as ASCII, or as Japanese data.

For almost all other character encoding systems, all characters in the 0-127 range are their equivalent ASCII counterparts. This means that in most cases, a plain ASCII string is also a valid string in just about every character set.

ISO character sets

A common set of ‘character sets’ in use is the ISO-8859-X ones. These character sets have the following characteristics (from here note that this particular document is fairly old and out of date):

They are all SBCS with a maximum of 256 characters (1 byte per char)
They use the characters 0xA0 through 0xFF to represent national characters, while the characters in the 0x20-0x7F range are those used in the ASCII, thus these character sets are backwards compatible with ASCII.
The characters 0x80 through 0x9F are earmarked as extended control characters, and are not used for encoding characters. These characters are not currently used to specify anything. A practical reason for this is interoperability with 7 bit devices (or when faulty software strips the 8th bit). Devices would then interpret the character as some control character and put the device in an undefined state. (When the 8th bit gets stripped from the characters at 0xA0 to 0xFF, a wrong character is represented, but this cannot change the state of a terminal or other device.)

Each character set is designed for a particular language or set of languages. These are:

8859-1	Europe, Latin America
8859-2	Eastern Europe
8859-3	SE Europe/miscellaneous (Esperanto, Maltese, etc.)
8859-4	Scandinavia/Baltic (mostly covered by 8859-1 also)
8859-5	Cyrillic
8859-6	Arabic
8859-7	Greek
8859-8	Hebrew
8859-9	Latin5, same as 8859-1 except for Turkish instead of Icelandic
8859-10	Latin6, for Lappish/Nordic/Eskimo languages

Because of the SBCS nature of these character sets, there is no ISO-8859 char set for languages with large character sets (eg the Asian languages – Japanese, Korean, etc)

Asian characters sets

Because of the nature of Japanese, Korean, Chinese, Vietnamese, etc (collectively often-called CJKV), these languages need a large range of values in which to represent all their characters. In general 1 byte (256 values) is not nearly enough so the different encoding schemes for these languages are all MBCS. The most well known example is ‘Big5’, an encoding scheme to represent Chinese characters, and SHIFT-JIS for Japanese characters.

Character set problems

In general, each encoding is designed with a particular language in mind. One problem with the ISO-8859-X series is that you can’t tell which one is being used without being explicitly told. Given a file, does it have 8859-6 Arabic characters or 8859-7 Greek characters? Also what if you want Arabic and Greek characters in the same string or file, let alone 8859-6 Arabic characters and Big5 Chinese characters?

Unicode

Unicode introduction

The attempted solution to these issues is Unicode. From their web-site, this gives a good summary of their aim and the current situation

Fundamentally, computers just deal with numbers. They store letters and other characters by assigning a number for each one. Before Unicode was invented, there were hundreds of different encoding systems for assigning these numbers. No single encoding could contain enough characters: for example, the European Union alone requires several different encodings to cover all its languages. Even for a single language like English no single encoding was adequate for all the letters, punctuation, and technical symbols in common use.

These encoding systems also conflict with one another. That is, two encodings can use the same number for two different characters, or use different numbers for the same character. Any given computer (especially servers) needs to support many different encodings; yet whenever data is passed between different encodings or platforms, that data always runs the risk of corruption.

Unicode provides a unique number for every character, no matter what the platform, no matter what the program, no matter what the language.

Thus Unicode can be considered the ultimate character set, encompassing every character in every language. Unicode has a huge amount of support (from companies and in software infrastructure) and basically can be considered the standard way forward to deal with character set encoding issues.

Unicode history

As is always the way, Unicode was designed to be a clean sweep and support a nice clean mapping from the start. As is usually the way, things didn’t quite go as planned. It was initially thought that 65,536 characters would be enough to encompass all characters. Thus, initial Unicode implementations treated each character as an ‘unsigned 16 bit integer’, thus allowing us to go back to the ‘1 byte = 1 character’ rule, but instead ‘1 word = 1 character’ for any language. This would again make it easy to find the number of characters in a string, move around strings, etc.

Unfortunately, after a while it was realised those 65,536 characters would not be enough. Thus Unicode broke into several different definitions.

Every character would be assigned a number. This would be called a ‘Unicode Scalar Value’ or ‘code point’ and would be in the range 0x0 to 0x10FFFF (eg 16*65536 = 1,048,576 possible characters).
There would then be various encodings to map these values into 7 bit, 8 bit, 16 bit or 32 bit values. These are known as UTF-7, UTF-8, UTF-16 and UTF-32
For UTF-32, each code point would map directly to the corresponding 32 bit value
For all existing defined values, UTF-16 and Unicode would be the same and thus backwards compatible
For new values, there would be certain values that would ‘pair’ together in UTF-16. Namely 0xD800 through 0xDBFF and 0xDC00 through 0xDFFF. This is explained in more details later.
For UTF-8, between 1 and 6 characters would be used to encode each ‘Unicode Scalar Value’.

UTF-16

UTF-16 consists of a series of 16 bit unsigned integers. The following ranges are defined:

0x0000 to 0xD7FF – same as Unicode code points 0x000000 to 0x00D7FF
0xD800 to 0xDBFF – low surrogates
0xDC00 to 0xDFFF – high surrogates
0xE000 to 0xF8FF – for private use (not defined by standard, can be used by parties that understand each others meaning)

The surrogates are the interesting part, that allow the representation of code point greater than 0xD7FF in 2 16-bit integer values. Basically a value in the 0xD800 to 0xDBFF range must ALWAYS be followed by a value in the 0xDC00 to 0xDFFF. This makes it easy to scan a string for a low or high surrogate and either look 1 word forward or 1word backward to find the corresponding surrogate.

Looking at the surrogate values we see that the range 0xD800 -> 0xDBFF has 1024 values, ditto for 0xDC00 -> 0xDFFF. Thus the surrogates can represent 1024 * 1024 possible value combinations, which covers the 1,048,576 possible different Unicode code points.

UTF-8

UTF-8 is the most complex of the unicode encodings, but is also commonly used in the Unix world and with XML and HTML. The reason for this is that many existing programs that deal with strings in only an opaque way (eg. just move around this ‘NULL terminated array of 8 bit chars’) will work fine with UTF-8. There’s no need to change to using word pointers or the like.

As an added advantage, UTF-8 is backwards compatible with ASCII as well. That is, all byte values < 0x80 represent ASCII characters, so you can transition your code internally from using ASCII to UTF-8 without having to make a big hit change.

UCS-2 and UCS-4

You’ll probably encounter these terms with regard to Unicode encoding as well. UCS-2 is just UTF-16 without the surrogate pairs. UCS-4 is the same as UTF-32 as far as I can tell.

The following really good summary is copied directly from:

http://www-106.ibm.com/developerworks/library/utfencodingforms/

Programming using UTF-8 and UTF-16 is much more straightforward than with other mixed-width character encodings. For each code point, they have either a singleton form or a multi-unit form. With UTF-16, there is only one multi-unit form, having exactly two code units. With UTF-8, the number of trailing units is determined by the value of the lead unit: thus you can't have the same lead unit with a different number of trailing units. Within each encoding form, the values for singletons, for lead units, and for trailing units are all completely disjoint. This has crucial implications for implementations:

No overlap.

If you search for string A in a string B, you will never get a false match on code points. You never need to convert to code points for string searching. False matches never occur because the end of one sequence can never be the same as the start of another sequence. Overlap is one of the biggest problems with common multi-byte encodings like Shift-JIS. All of the UTFs avoid this problem.

Determinate boundaries.

If you randomly access into text, you can always determine the nearest code-point boundaries with a small number of machine instructions.

Pass-through.

Processes that don't look at particular character values don't need to know about the internal structure of the text.

Simple iteration.

Getting the next or previous code point is straightforward, and only takes a small number of machine instructions.

Slow indexing.

Except in UTF-32, it is inefficient to find code unit boundaries corresponding to the n th code point, or to find the code point offset containing the n th code unit. Both involve scanning from the start of the text.

Frequency.

Because the proportion of world text that needs surrogate space is extremely small, UTF-16 code should always be optimized for the single code unit. With UTF-8, it is probably worth optimizing for the single-unit case also, but not if it slows down the multi-unit case appreciably.

UTF-8 has one additional complication, called the shortest form requirement. Of the possible sequences in Table 1 that could represent a code point, the Unicode Standard requires that the shortest possible sequence be generated. When mapping back from code units to code points, however, implementations are not required to check for the shortest form. This problem does not occur in UTF-16.

Most systems will be upgrading their UTF-16 support for surrogates in the next year or two. This upgrade can use a phased approach. From a market standpoint, the only interesting surrogate-space characters expected in the near term are an additional set of CJK ideographs used for Japan, China, and Korea. If a system is already internationalized, most of the operations on the system will work sufficiently well that minor changes will suffice for these in the near term.

Other Unicode considerations

Probably the two hardest problems with Unicode strings are normalisation and collation.

Normalisation

Roughly speaking normalisation is required because Unicode sometimes allows multiple representations of a single string. For instance, there is a code point for ‘á’ (the letter ‘a’ with an accent), but also a code point for ‘a’ and a code point for a zero-width ‘´’ accent character. Thus the single code point á and the two code points a´ together should be considered as the same character for comparison purposes.

http://www-106.ibm.com/developerworks/library/internationalization-support.html

Multiple representations

Unicode includes the concept of a combining character, which is a character that (generally) modifies the character before it in some way rather than showing up as a character on its own. For example, the letter é can be represented using a regular letter e followed by a combining acute-accent character. This occupies two storage positions in memory. Unicode defines these characters to allow flexibility in its use.

If you need a certain type of accented character, Unicode can give you the base character and the accent rather than having to assign a whole new code-point value to the actual combination you want to display. This greatly expands the effective number of characters that Unicode can encode. In some cases, such as Korean, "characters" are broken up into smaller units that can be combined into the actual characters. This was done to save on code-point assignments.

In many cases, including the two examples above, Unicode actually does have a single code point representing the entire unit. The letter é can be represented using its own single code-point value, and all Korean syllables (including many that don't naturally occur in Korean) have single code-point values. This is because most constituencies prefer the precomposed versions. Storing é ‘s two characters would both complicate processing and increase storage size. This introduces a significant pro-English bias into what's supposed to be an international standard. Similarly, requiring 6 bytes for each Korean syllable when only 2 bytes are required for each Japanese ideograph introduces an anti-Korean bias.

As a result, many characters in Unicode have multiple possible representations. In fact, many characters that don't have a specific code point in Unicode (for example, many letters with two diacritical marks on them) have multiple sequences of Unicode characters that can represent them. This can make two strings that appear to be the same to the user appear to be different to the computer.

The Unicode standard generally requires that implementations treat alternative "spellings" of the same sequence of characters as identical. Unfortunately, this can be impractical.

Most processes just want to do bitwise equality on two strings. Doing anything else imposes a huge overhead both in executable size and in performance. But because many of these processes can't control the source of the text, they're stuck. The traditional way of handling this is normalization -- picking a preferred representation for every character or sequence of characters that can be represented multiple ways in Unicode. The Unicode standard does this by declaring a preferred ordering for multiple accent marks on a single base character and by declaring a "canonical decomposition" into multiple characters for every single character that can also be represented as two or more characters.

Four normalization forms

The Unicode standard also defines a set of "compatibility decompositions" for characters that can be decomposed into other characters but only with the loss of some information. Newer versions of the standard also define "compatibility compositions" and "canonical compositions." This actually gives you a choice of four "normalized forms" for a string.

A program can make bitwise equality comparison work right by declaring that all strings must be in a particular normalized form. The program has the choice of requiring that all text fed to it be normalized in order to work right (delegating the work to an outside entity), or normalizing things itself. The World Wide Web Consortium ran into this very problem and solved it by requiring all applications that produce text on the Internet to produce it in normalized form. Software that merely transfers text from one place to another or displays it can choose to normalize it again on receipt, but doesn't have to -- if everybody's followed the rules, it's already normalized.

Collation

Collation has to do with the property of ordering of strings. In English and most western languages, we understand well that the alphabet has an order, such that B comes before C but after A. However even within that there are more difficult questions.

1. Are ‘a’ and ‘A’ equivalent?

2. What about á and a?

3. How do we order symbols?

Typically ordering has basically been a binary comparison of ASCII text, or ASCII text converted all to one case (lower or upper).

Obviously this becomes much harder for Unicode. How do you give a consistent ordering to every possible character in all languages? The Unicode solution is given in this document, the Unicode Collation Algorithm. http://www.unicode.org/unicode/reports/tr10/

The general process for ordering Unicode strings is a bit different from the usual ASCII ordering method. Basically with ASCII you can do a byte by byte comparison. Because of the complexity of Unicode, the algorithm is designed as follows:

Briefly stated, the Unicode Collation Algorithm takes an input Unicode string and a Collation Element Table, containing mapping data for characters. It produces a sort key, which is an array of unsigned 16-bit integers. Two or more sort keys so produced can then be binary-compared to give the correct comparison between the strings for which they were generated.

1.        For each string to compare, the algorithm generates an array of 16-bit word values

2.        For each string, you can then do a binary compare on the generated word arrays to determine the relative ordering of two strings

The algorithm also specifies that you can have different ‘levels’ of ordering. For latin script, the first three correspond roughly to:

1.        alphabetic ordering (eg. AbC == abc, ábc == Abc)

2.        diacritic ordering (eg. ábc != abc)

3.        case ordering (eg Abc != abc)

The collation algorithm is reasonably complex, and you’ll probably want to use an already written library to do this for you.

Using Unicode (Unicode libraries)

One of the best libraries for dealing with Unicode is the IBM International components for Unicode (http://oss.software.ibm.com/developerworks/opensource/icu/project/). This library is very extensive and covers much more than what I’ve discussed in this document (eg. Most of the stuff in item 2 at the start of this document: calendar, date, time, currency, numbers, line/word/sentence breaks, etc).

With regard to what I’ve discussed so far, it includes routines for trans-coding between different character sets/code pages, as well as being able to sort strings based on a collation order.

Using with Windows

Windows NT has native Unicode support. Unfortunately this means UCS-2 support. UTF-16 (surrogate pairs) support is limited (see this link http://msdn.microsoft.com/library/default.asp?url=/library/en-us/intl/unicode_192r.asp). The Windows API does not support UTF-8 directly, it supports only ASCII or UCS-2. Thus to use Unicode UTF-8 strings with the Windows, you first have to convert the strings to UTF-16. You can do this using the calls WideCharToMultiByte and MultiByteToWideChar. These both accept a code page of ‘CP_UTF8’ to converting from/to UTF-8.

More resources

External links

Lots of the included links above are worth reviewing first. Here are some more:

Adding internationalization support to the base standard for JavaScript

Lessons learned in internationalizing the ECMAScript standard

http://www-106.ibm.com/developerworks/library/internationalization-support.html