- Character: a minimum piece of information, that has semantic value;
- Code point: a unique number, assigned to a specific character;
- Code unit: a series of bits, used to represent a character in a given encoding system. For ASCII a code unit is 7 bits, for UTF-8 is 8 bits, for UTF-16 is 16 bits for basic Unicode code points and 32 for extended (so-called surrogate pairs).
A long time ago, when the computers era was only starting to sprout, there was no another language than English. So everything the programmers needed to store in a computer was numbers, punctuation signs, English letters a-z, their uppercase variants and a bunch of non-printable symbols which represented, for example, null, audible bell, backspace, horizontal tab etc. To transfer your text to other computers, you need somehow to standardize your characters. The American Standard Code for Information Interchange (ASCII) did this, which has assigned to each character its own number (called a code point). For all those characters, it was enough 7 bits (2^7 = 128 different decimal numbers that can be represented in binary format with 7 bits). For instance, the ASCII code point for escape system code is 27, uppercase C is 67, and the last but one 126th number maps to ~ (tilde sign). The ASCII code points are in decimal and the binary representation of each code point is exactly how the character is stored in the computer memory. So now you can easily transfer your text data to another computer and it will “understand” that ‘01000001″ is “A”, “110000” is “0” (zero) and so on.
Small remark on the size of the word (to the number of bits processed by a computer’s CPU in one step). It depends on the computer architecture and could be actually any size (not the only current standard of 32 and 64 bits). Many other sizes have been used historically, including 6, 9, 12, 18, 24 and even 60. After the introduction of the IBM System/360 architecture, where the word size was 8 bits, common approach became to choose the word size as multiples of 8 bits, with 16, 32, and 64 bits. From this point, I will refer to 8 bits as a byte (which is a standard nowadays). And of course, every ASCII-encoded character can be represented in 8 bits.
So far so good, but the Internet came and it appeared that besides English, there was the whole zoo of other languages and writing systems. No chance to squeeze them in 7 bits ☺. So the Unicode consortium had a lot of work to make a consistent enumeration for almost all languages in the world and even some ancient Egyptian hieroglyphs and tons of fancy symbols ⛇. Each character was assigned a code point in format U+ (stands for Unicode) following by hexadecimal number (U+0041 is A). ASCII character numbers remain the same within the Unicode and fit into 1 byte. Basic Multilingual Plane (BMP) – is a subset of the Unicode character set, which is sufficient for the vast majority of characters and fit in 2 bytes. All other characters need at most 4 bytes. The next question which arises is how to convert these code points from “U+024F” to bits.
Character encoding scheme
The character encoding system (or scheme) is a pretty general term, which means the algorithm of mapping characters (or code points assigned to them) to some other representation: bits in our case. There are a lot of character encoding systems and, for example, Morse code is also an encoding system, where letters and numbers are translated to the series of dots and dashes:
A . _
B _ . . .
C _ . _ .
The most common approaches which are used today to encode the Unicode numeric representation are UTF-16 and UTF-8 (Unicode Transformation Format).
UTF-16 is a variable length encoding, which can encode every Unicode character using 16 for BMP or 32 bits for all Unicode characters.
- developed from an earlier fixed-width 16-bit encoding known as UCS-2;
- can be encoded with different endianness;
- used by Windows, C#, Qt, Java, older versions of Python.
UTF-8 is a variable length encoding, which uses 8 bits to encode ASCII, 16 for most European languages, Arabic and Hebrew, 24 for BMP and 32 for all Unicode characters.
- uses byte flags to indicate whether it is a multibyte character and how many bytes to read;
- endianness independent;
- efficient in space;
- used everywhere on the web, a preferable format for cross-platform support.
If your computer does not know which encoding to use to interpret (to decode) “00111000100010….0110000101…” all these bits make no sense and you cannot get any meaningful information out of there.
Under encoding several terms can be meant:
– the mapping between abstract symbols and characters to standardized unique numbers (like ASCII or Unicode);
– the scheme of translating code points to binary representation and vice versa: correctly interpret bits series (again ASCII, UTF-8, UTF-16).
It could be a little bit confusing, but on the other hand, they are tightly connected.
- ASCII is both, a standard for mapping abstract characters to numbers (code points) and the encoding system to translate the code points to bits. A binary representation of ASCII character code point is exactly it’s encoding in a computer. Needs 1 byte to represent a character within its range.
- Unicode is a standard, which “provides a unique number for every character, no matter what the platform, no matter what the program, no matter what the language”. It needs 1 to 4 bytes to represent a character. There are numerous ways to encode Unicode code points into bits. Most common are UTF-8 and UTF-16.
- UTF-16 is an encoding scheme, which has 2-bytes code unit and uses one or two code units to encode a Unicode character. It is endianess dependent.
- UTF-8 is an encoding scheme with 1-byte code unit and uses one to four code units for encoding a Unicode character. It is endianess independent. UTF-8 is more efficient in memory than UTF-16.