Utf-8

UTF-8 (from the English Unicode Transformation Format, 8-bit - “Unicode conversion format, 8-bit”) is a common text encoding standard that allows for more compact storage and transmission of Unicode characters using a variable number of bytes (from 1 to 4) , and providing full backward compatibility with 7-bit ASCII encoding. The UTF-8 standard is officially enshrined in RFC 3629 and ISO / IEC 10646 Annex D. The UTF-8 encoding is now dominant in the web space. It has also found widespread use in UNIX-like operating systems ^[1] . The UTF-8 format was developed on September 2, 1992 by Ken Thompson and Rob Pike , and implemented in Plan 9 ^[2] . The encoding identifier on Windows is 65001 ^[3] .

Comparing UTF-8 and UTF-16 , it can be noted that UTF-8 gives the greatest compactness for texts in Latin , since Latin letters without diacritics, numbers and the most common punctuation are encoded in UTF-8 with only one byte, and codes these characters correspond to their codes in ASCII . ^[4] ^[5]

Content

1 Coding Algorithm
- 1.1 Coding Examples
2 UTF-8 marker
3 Fifth and sixth bytes
4 notes
5 Links

Coding Algorithm

The coding algorithm in UTF-8 is standardized in RFC 3629 and consists of 3 stages:

1. Determine the number of octets ( bytes ) required to encode a character. The character number is taken from the Unicode standard.

Character Number Range	Octet Required
`00000000-0000007F`	one
`00000080-000007FF`	2
`00000800-0000FFFF`	3
`00010000-0010FFFF`	four

For Unicode characters with numbers from U+0000 to U+007F (occupying one byte with zero in the high bit), the UTF-8 encoding is fully consistent with the 7-bit US-ASCII encoding.

2. Set the most significant bits of the first octet in accordance with the required number of octets determined in the first stage:

0xxxxxxx - if one octet is required for encoding;
110xxxxx - if two octets are required for encoding;
1110xxxx - if three octets are required for encoding;
11110xxx - if four octets are required for encoding.

If encoding requires more than one octet, then in octets 2-4, the two most significant bits are always set to 10 ₂ (10xxxxxx). This makes it easy to distinguish the first octet in the stream, because its most significant bits are never equal to 10 ₂ .

Number of octets	Significant bits	Template
one	7	`0xxxxxxx`
2	eleven	`110xxxxx 10xxxxxx`
3	16	`1110xxxx 10xxxxxx 10xxxxxx`
four	21	`11110xxx 10xxxxxx 10xxxxxx 10xxxxxx`

3. Set the significant octet bits according to the Unicode character number, expressed in binary form. Start filling in the low-order bits of the symbol number, putting them in the low-order bits of the last octet, continue from right to left until the first octet. The free bits of the first octet, remaining unused, fill with zeros.

Coding Examples

Symbol		Binary character code	UTF-8 in binary	UTF-8 in hexadecimal
$	`U+0024`	`100100`	`0 0100100`	`24`
¢	`U+00A2`	`10 100010`	`110 00010 10 100010`	`C2 A2`
€	`U+20AC`	`10 0000 10 101100`	`1110 0010 10 000010 10 101100`	`E2 82 AC`
𐍈	`U+10348`	`1 0000 0011 01 001000`	`11110 000 10 010000 10 001101 10 001000`	`F0 90 8D 88`

UTF-8 Marker

To indicate that the file or stream contains Unicode characters, a byte order mark (BOM ) can be inserted at the beginning of the file or stream, which in the case of encoding in UTF-8 takes the form of three bytes: EF BB BF ₁₆ .

	1st byte	2nd byte	3rd byte
Binary code	`1110 1111`	`1011 1011`	`1011 1111`
Hexadecimal code	`EF`	`BB`	`BF`

Fifth and Sixth Bytes

Initially, UTF-8 encoding allowed the use of up to six bytes for encoding one character, however, in November 2003, RFC 3629 prohibited the use of fifth and sixth bytes, and the range of encoded characters was limited to the character U+10FFFF . This was done to ensure compatibility with UTF-16.

Notes

↑ Usage Statistics of Character Encodings for Websites, June 2011
↑ http://www.cl.cam.ac.uk/~mgk25/ucs/utf-8-history.txt
↑ Code Page Identifiers - Windows applications | Microsoft docs
↑ Well, I'm Back. String Theory Robert O'Callahan (March 1, 2008). Date of treatment March 1, 2008. Archived August 23, 2011.
↑ Rostislav Chebykin. All encodings are encoding. UTF ‑ 8: modern, competent, convenient. (unspecified) . HTML and CSS . Date of treatment March 22, 2009. Archived August 23, 2011.

Links

[1] Usage Statistics of Character Encodings for Websites, June 2011

[2] ttp://www.cl.cam.ac.uk/~mgk25/ucs/utf-8-history.txt

[3] Code Page Identifiers - Windows applications | Microsoft docs

[stringtheory-4] Well, I'm Back. String Theory Robert O'Callahan (March 1, 2008). Date of treatment March 1, 2008. Archived August 23, 2011.

[vsem-5] Rostislav Chebykin. All encodings are encoding. UTF ‑ 8: modern, competent, convenient. (unspecified) . HTML and CSS . Date of treatment March 22, 2009. Archived August 23, 2011.