1.1 Character Sets
- A character set in a sequence of characters (letters, symbols, numbers, etc.).
- Each character is represented by a number.
e.g. 65=A, 66=B, 67=C, ... 1234=Ӓ, ...
- Examples of character sets are:
ASCII, ISO 8859-1, Windows 1252
1.2 Character Encoding
- A character encoding is a means of representing a character set in a computer file.
- For ASCII and Windows 1252 (or ANSI) character sets, its easy, 1 byte = 1 character.
- For large character sets, with more than 256 characters, it is more complex, as more than 1 byte per character is used.
- "UTF-8" uses 1, 2, 3 or 4 bytes per character.
1.3 Character Entities in HTML and XML
- An entity reference is of the form:
" € ©
- A numeric reference (in HTML and XML) for character 255 is of the form:
Ϩ (decimal) or &xffc3; (hexidecimal)
- ASCII is the original Character Set, with 128 characters defined.
- 1 byte = 1 character.
2.2 ISO 8859-1
- This is the ISO "Western European" character set.
- It is the original "web" character set, and used as the default by older browsers.
- ISO 8859-1 is a subset of the larger UCS/Unicode character set (not quite true, but almost)
- It uses the "same" character set at UTF-8 (for codes #0 to #255), but a different character encoding
- It is now depreciated (obsolete) - use UTF-8 instead.
2.3 Unicode and UCS (Universal Character Set)
- This is a very large character set. It is a combination of the ISO 8859-1 characters,
plus mathematical and other symbols,
plus the Chinese, Hebrew, Japanese, Greek, Thai, Persian and other alphabets.
- Some special cases :
- there are spaces reserved for 'user defined' characters,
- some characters can be combined to make composite characters (e.g. e and an accent to make e acute is 2 characters in the file, but 1 on the screen)
- Unicode/UCS is a character set. It is encoded using UTF-8
2.4 Windows 1252 / ANSI Character Set
- This is the Windows character set.
- It is encoded using 1 bytes (0-255) per character.
- From 0-127, its the same as ASCII
- Between 0x80 and 0x9F there are differences. This is the problem area, as these character positions are 'not defined' in ISO 8859-1 and UTF-8
- From 0xA0 and above, its "the same" as ISO 8859-1.
- UTF-8 is actually a character encoding, not a character set. Colloquially, it is now used to mean "Unicode/UCS with the UTF-8 encoding"
- Its a means of using 1, 2 , 3 or 4 bytes to store a very large character set.
- ASCII characters (0-127) take up 1 byte, so its backwards compatible.
- £, maths symbols, Chinese and Japanese characters take up 2, 3 or 4 bytes
- Some Windows editors use a 'BOM', a marker at the front of a file to indicate that the file contains UTF-8 encoded characters. (Actually, its a 2 byte character that's illegal in UTF-8). Not part of the spec.
Whatever you've used in the past, UTF-8 (Unicode/UCS), is the thing to aim for. Google, Blogger etc all use it.
4.1 From Windows 1252 / ANSI to the ISO character sets
- If converting text from a Windows file to a web page in ISO format, you may have to map some 'high byte' characters, e.g. the euro symbol, as the character numbers will not be the same.
- If copy-and-past'ing, windows will take care of the conversion for you.
4.2 From ISO 8859-1 to UTF-8/UCS/Unicode
- Viewing a ISO 885901 file in a web browser page set to UTF-8 will display any characters greater than 128 as illegal characters
- In terms of character sets, the conversion is straight forward, as there are "no" differences you are likely to encounter.
- The character encoding is the problem. Example: In ISO 8859-1, character(165) is stored as binary 165. In UTF-8, it should be 2 bytes. The single byte will be an illegal UTF-8 character.
- The solution is programing language dependent or editor dependant.
- For example, in the Notepad++ editor, there is a 'convert ANSI to UTF-8' option.
- In perl: $string =~ s/ ( [\x80-\xff] ) / chr($1) /gxe;
4.5 UTF-8 to ISO 8859-1
- Viewing a UTF-8 file in a web browser page set to ISO 8859-1 will display 2 (or more)characters for each UTF-8 'hi byte' character.
e.g. For 2 byte UTF-8 characters, it will display an illegal character, followed by the character you want.
- The solution: First, identify all characters in your input stream, that don't have ISO 8850-1 equivalents
- Maybe convert
- all the exotic UTF-8 bullet points to &#nn;
- the exotic hyphens to - (minus sign)
- the various 6, 66, 9, 99 style quotes to ' and "
- For XML feeds with character codes greater than 255, consider using &#nn; escape sequences (rather than &name; or the binary code, both of which will cause problems)
A well known international newspaper has a publishing system that uses UTF-8, and a series of XML feeds that use ISO 8859-1
These character positions (0x80 to 0x9f) are not defined (illegal) in UTF8 and ISO 8895-1
In practise, you may wish to map characters like ‘ and ’ style quotes to ' etc
0x80 0x20ac ;Euro Sign 0x81 0x0081 0x82 0x201a ;Single Low-9 Quotation Mark 0x83 0x0192 ;Latin Small Letter F With Hook 0x84 0x201e ;Double Low-9 Quotation Mark 0x85 0x2026 ;Horizontal Ellipsis 0x86 0x2020 ;Dagger 0x87 0x2021 ;Double Dagger 0x88 0x02c6 ;Modifier Letter Circumflex Accent 0x89 0x2030 ;Per Mille Sign 0x8a 0x0160 ;Latin Capital Letter S With Caron 0x8b 0x2039 ;Single Left-Pointing Angle Quotation Mark 0x8c 0x0152 ;Latin Capital Ligature Oe 0x8d 0x008d 0x8e 0x017d ;Latin Capital Letter Z With Caron 0x8f 0x008f 0x90 0x0090 0x91 0x2018 ;Left Single Quotation Mark 0x92 0x2019 ;Right Single Quotation Mark 0x93 0x201c ;Left Double Quotation Mark 0x94 0x201d ;Right Double Quotation Mark 0x95 0x2022 ;Bullet 0x96 0x2013 ;En Dash 0x97 0x2014 ;Em Dash 0x98 0x02dc ;Small Tilde 0x99 0x2122 ;Trade Mark Sign 0x9a 0x0161 ;Latin Small Letter S With Caron 0x9b 0x203a ;Single Right-Pointing Angle Quotation Mark 0x9c 0x0153 ;Latin Small Ligature Oe 0x9d 0x009d 0x9e 0x017e ;Latin Small Letter Z With Caron 0x9f 0x0178 ;Latin Capital Letter Y With Diaeresis