Common encoding schemes
- ASCII (American Standard Code for Information Interchange)
- a character encoding standard used to represent text in computers and other devices that use text
- UTF-8
- the dominant character encoding for the web, used to represent a wide array of characters from various languages
- a variable-width encoding able to encode all 1,112,064 valid character code points in Unicode using one to four 8-bit bytes
- UTF-16
- Like UTF-8, this is a variable-width encoding, but it uses 16-bit units
- It’s used predominantly for native encoding in Microsoft products
- ISO-8859-1
- Also known as Latin-1, it is a single-byte encoding that can represent the first 256 Unicode characters.
- Base62
- It is a binary-to-text encoding that represents binary data in an ASCII string format
- Consist of
0-9
, a-z
, A-Z
- This is a case-sensitive encoding.
- a good choice for URL shortening services, as it doesn’t use any special characters that would need to be escaped in a URL
- Base64
- This is used to encode binary data, particularly when that data needs to be sent over email.
- Base64 takes binary data and turns it into text so it can be more easily transmitted in scenarios that can’t handle binary.
- Consist of
0-9
, a-z
, A-Z
, +
, /
and =
for padding
- It is not URL safe because the characters
+
, /
and =
have special meanings in URLs
- URL Encoding (Percent Encoding)
- It is used to encode URL strings into a format that can be transmitted over the Internet.
- Hexadecimal Encoding
- Often used in programming, it encodes binary data as a string of hexadecimal digits.
What’s Unicode?
- Unicode is the standard for character encoding
- Unicode is a map of characters to numbers, and encoding schemes like UTF-8, UTF-16, and UTF-32 specify how to represent those numbers in binary.
- For example, UTF-8 is a specific implementation of Unicode (A -> 41 -> binary)
Difference between ASCII, UTF-8 and UTF-16
- ASCII is a 7-bit encoding standard that can only represent 128 characters, including the standard English alphabet, numbers, and basic punctuation
- UTF-8 is a variable-length encoding scheme that uses a minimum of 8 bits to represent each character and can represent up to 1,112,064 unique characters.
- This means that characters in the ASCII range (0-127) can be represented using a single byte, while other characters may use up to 4 bytes.
- UTF-16 uses a minimum of 16 bits to represent each character, making it a fixed-width encoding scheme.
- This means that all characters use the same number of bytes, making it easier to process strings in this encoding, but also making it less space-efficient than UTF-8.
How are characters stored in the server?
For examples, store “hello 世界” in memory, will be converted to the code point based on the unicode standard used before being stored into the disk in binary
104 101 108 108 111 32 228 184 150 231 149 140
According to UTF-8 encoding table, choose decimal
from display format for UTF-8 encoding
and now we know what each number represents
use decimal to hex converter, now we got
104
(U+68
) -> h (utf-8)
101
(U+65
) -> e (utf-8)
108
(U+6C
) -> l (utf-8)
108
(U+6C
) -> l (utf-8)
111
(U+6F
) -> o (utf-8)
32
(U+20
) -> (space) (utf-8)
228 184 150
(U+4E16
) -> 世 (utf-8)
231 149 140
(U+754C
) -> 界 (utf-8)
choose U+4DC0 ...U+4DFF Yijing Hexagram Symbols
, you can see 世
represented as decimal 228 184 150
How does decoding tell how many bytes are used to represent a single Unicode code point?
The leading bits of the first byte of a character can be used to determine wether it’s a one-byte character and a three-byte character.
For example, if the first byte starts with 11110
, it indicates that the character is represented using 4 bytes.
If the first byte starts with 110
, it means that the character is represented using 2 bytes.
The process of converting bytes into a character (世
)
If you convert the decimal 228 184 150
of 世
into hex, you will get E4B896
which is not its code point.
The code point of 世
is U+4E16
instead of U+E4B896
.
How is it converted?
First, we need to convert the decimal to binary
228 = 11100100
184 = 10111000
150 = 10010110
Next, concatenate these binary, and we get 24-bit binary representation of the code point
11100100 10111000 10010110
Then, we have 1110
as leading bits in the first byte, it tells us that it’s a 3-byte sequence and the following 2 bytes will start with 10
.
Now let’s remove 1110
from the first byte, and 10
from the second and third byte. After that, concatencate them, we will get:
0100 111000 010110
Finally, convert this binary to hex and we will get 4E16
. It’s the code point of chinese character 世
Apply the same steps to 界
, you will get its code point 754C
ref:
- chatGPT
- good explanation in this video