UTF-8 And Bidirectional Text

In the beginning there was ASCII code, which supported the English alphabet. Standard ASCII characters consist of 7 bits, which are not enough to support other languages. Because a byte usually consists of 8 bits, ASCII characters can be extended to characters of the ISO-8859 family. This is not very convenient if you want to send it over the web, and the reader needs to know what type of encoding is used. In addition, an ISO-8859 standard limits the text to 256 chars, and there are standards that support more characters and more special symbols.

Unicode extends ASCII

Unicode supports multi-byte characters, but still is an extension of ASCII, which means that text originally in ASCII remain the same in Unicode. How? By using the fact that ASCII characters are 7-bit long. So, if the characters leftmost bit is 0, it is an ASCII character.

The meaning of the byte in a Unicode string is determined by the number of ‘1’ bits before the first ‘0’. The following table will show you what the bytes mean:

Byte Value Meaning
0xxxxxxx Regular ASCII – Range: 00h-7Fh
110xxxxx First of a 2-byte character. Range: 0080h – 07FFh
1110xxxx First of a 3-byte character. Range: 0800h – FFFFh
11110xxxx First of a 4-byte character. RangeL 10000h – 10FFFFh
10xxxxxx Trailing byte. The rest of the multi-byte character bytes are of this form.

The conversion of the Unicode value into byte is very simple:

  1. According to the value, find the range.
  2. Place the bits of the value in the available bits of each byte.

For example, the value of the musical symbol “♬” is U+266c. In binary ‘0010 0110 0110 1100’

In the table you can see that its in the range 0800h-FFFFh. Thus, it is a 3-byte characters.

The first byte will hold the 4 bits ‘1110’ and the characters first four bits, i.e, 11100010 in binary, or E2h. (0xE2)

The second byte will hold ’10’ and the next 6 bits, i.e. 10011001, or 99h. (0x99)

The third byte will hold ’10’ and the last 6 bits, i.e. 10101100, or ACh. (0xAC)

The Bidirectional Type (Right-to-left? Left-to-right?)

Information about each Unicode characters can be found in the file ‘UnicodeData.txt’. To download it browse the Unicode Consortium site. “Side menu -> The Unicode Standard -> Unicode Character Database” will get you in a page where you can find a link to the latest version of the unicode character data.

The file ‘UnicodeData.txt’ and others, will be found here.

This file is a reference database of character information. Each character has a line in the database, and the fields are separated by semicolons. The bidirectional file is the fifth field in the row. It can be the character direction (Left to right, Right to left), a white-space, a non-spacing mark (a mark that is displayed without changing the position, such as Hebrew points, Arabic vowel signs, accents above vowels, etc.).

More information about the fields can be found in the Unicode Character Database page, section 5.3, table 9. To get there from the home-page:

  1. From the ‘Quick Links’ frame, choose specifications.
  2. From the Specification page, choose Unicode Character Database, under ‘General’.

Table 9, contains the description of each fields. The field number (after which semicolon?) is in parentheses in column 4 of the table.

The Algorithm

In the Specifications page – mentioned in the previous section -, under “Rendering”, you can find a link to the algorithm converting input bidirectional text into visual, i.e. how the text should be displayed if we just typed it left-to-right.

FriBiDi is an implementation of the algorithm in PHP.