Cablechip Solutions

web development with Unix, Perl, Javascript, HTML and web services

UTF8

Character Sets and Character Encodings : UTF-8 and ISO8859-1


1. Definitions

1.1 Character Sets

1.2 Character Encoding

1.3 Character Entities in HTML and XML

2. Character Sets

2.1 ASCII

2.2 ISO 8859-1

2.3 Unicode and UCS (Universal Character Set)

2.4 Windows 1252 / ANSI Character Set

3 Character Encoding

3.1 UTF-8

4 Character Set Conversion Problems

Whatever you've used in the past, UTF-8 (Unicode/UCS), is the thing to aim for. Google, Blogger etc all use it.

4.1 From Windows 1252 / ANSI to the ISO character sets

4.2 From ISO 8859-1 to UTF-8/UCS/Unicode

4.5 UTF-8 to ISO 8859-1

5 Case Study

A well known international newspaper has a publishing system that uses UTF-8, and a series of XML feeds that use ISO 8859-1

  1. Analyse an entire year's worth or newspaper articles.
    • Make a list of every unique characters used.
    • Cater for &#nn; and &name; style characters.
  2. Map all the UTF-8 characters found (with character code greater than 128) to ISO 8859-1 equivalents.
  3. Flag up any UTF-8 characters encountered in the conversion process which are not covered by this mapping.
    • Again, cater for &#name; and &name; characters.
  4. Escape all characters greater than 128 with the XML &#nn; escape sequence, so the output file is pure ASCII

6 Appendix : Differences between Windows 1252 and the ISO Character Sets

These character positions (0x80 to 0x9f) are not defined (illegal) in UTF8 and ISO 8895-1

In practise, you may wish to map characters like ‘ and ’ style quotes to ' etc

0x80 0x20ac ;Euro Sign
0x81 0x0081
0x82 0x201a ;Single Low-9 Quotation Mark
0x83 0x0192 ;Latin Small Letter F With Hook
0x84 0x201e ;Double Low-9 Quotation Mark
0x85 0x2026 ;Horizontal Ellipsis
0x86 0x2020 ;Dagger
0x87 0x2021 ;Double Dagger
0x88 0x02c6 ;Modifier Letter Circumflex Accent
0x89 0x2030 ;Per Mille Sign
0x8a 0x0160 ;Latin Capital Letter S With Caron
0x8b 0x2039 ;Single Left-Pointing Angle Quotation Mark
0x8c 0x0152 ;Latin Capital Ligature Oe
0x8d 0x008d
0x8e 0x017d ;Latin Capital Letter Z With Caron
0x8f 0x008f
0x90 0x0090
0x91 0x2018 ;Left Single Quotation Mark
0x92 0x2019 ;Right Single Quotation Mark
0x93 0x201c ;Left Double Quotation Mark
0x94 0x201d ;Right Double Quotation Mark
0x95 0x2022 ;Bullet
0x96 0x2013 ;En Dash
0x97 0x2014 ;Em Dash
0x98 0x02dc ;Small Tilde
0x99 0x2122 ;Trade Mark Sign
0x9a 0x0161 ;Latin Small Letter S With Caron
0x9b 0x203a ;Single Right-Pointing Angle Quotation Mark
0x9c 0x0153 ;Latin Small Ligature Oe
0x9d 0x009d
0x9e 0x017e ;Latin Small Letter Z With Caron
0x9f 0x0178 ;Latin Capital Letter Y With Diaeresis