Software engineering notes

Encoding Schemes

Common encoding schemes

What’s Unicode?

Difference between ASCII, UTF-8 and UTF-16

How are characters stored in the server?

For examples, store “hello 世界” in memory, will be converted to the code point based on the unicode standard used before being stored into the disk in binary

104 101 108 108 111 32 228 184 150 231 149 140

According to UTF-8 encoding table, choose decimal from display format for UTF-8 encoding and now we know what each number represents

use decimal to hex converter, now we got

choose U+4DC0 ...U+4DFF Yijing Hexagram Symbols, you can see represented as decimal 228 184 150

How does decoding tell how many bytes are used to represent a single Unicode code point?

The leading bits of the first byte of a character can be used to determine wether it’s a one-byte character and a three-byte character.

For example, if the first byte starts with 11110, it indicates that the character is represented using 4 bytes. If the first byte starts with 110, it means that the character is represented using 2 bytes.

The process of converting bytes into a character ()

If you convert the decimal 228 184 150 of into hex, you will get E4B896 which is not its code point. The code point of is U+4E16 instead of U+E4B896.

How is it converted?

First, we need to convert the decimal to binary

228 = 11100100
184 = 10111000
150 = 10010110

Next, concatenate these binary, and we get 24-bit binary representation of the code point

11100100 10111000 10010110

Then, we have 1110 as leading bits in the first byte, it tells us that it’s a 3-byte sequence and the following 2 bytes will start with 10.

Now let’s remove 1110 from the first byte, and 10 from the second and third byte. After that, concatencate them, we will get:

0100 111000 010110

Finally, convert this binary to hex and we will get 4E16. It’s the code point of chinese character

Apply the same steps to , you will get its code point 754C

ref: